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Abstract Forms are our gates to the web. They enable us 
to access the deep content of web sites. Automatic form 
understanding provides appHcations, ranging from crawlers 
over meta-search engines to service integrators, with a key 
to this content. Yet, it has received little attention other than 
as component in specific applications such as crawlers or 
meta-search engines. No comprehensive approach to form 
understanding exists, let alone one that produces rich models 
for semantic services or integration with linked open data. 

In this paper, we present OPAL, the first comprehensive 
approach to form understanding and integration. We identify 
form labeling and form interpretation as the two main tasks 
involved in form understanding. On both problems OPAL 
pushes the state of the art: For form labeling, it combines 
features from the text, structure, and visual rendering of a 
web page. In extensive experiments on the ICQ and TEL- 8 
benchmarks and a set of 200 modern web forms OPAL out- 
performs previous approaches for form labeling by a signif- 
icant margin. For form interpretation, OPAL uses a schema 
(or ontology) of forms in a given domain. Thanks to this do- 
main schema, it is able to produce nearly perfect (> 97% ac- 
curacy in the evaluation domains) form interpretations. Yet, 
the effort to produce a domain schema is very low, as we pro- 
vide a Datalog-based template language that eases the spec- 
ification of such schemata and a methodology for deriving 
a domain schema largely automatically from an existing do- 
main ontology. We demonstrate the value of opal's form in- 
terpretations through a light-weight form integration system 
that successfully translates and distributes master queries to 
hundreds of forms with no error, yet is implemented with 
only a handful translation rules. 
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1 Introduction 

Unlocking the vast amount of data in the deep web for 
automatic processing has been a central part of "web as 
a database" visions in the past. The web offers unprece- 
dented choice and variety of products, but we lack tools 
to make these wealth of offers easily manageable. Say you 
are looking for a flat. Aren't you tired of filling registration 
forms with your search criteria on the websites of hundreds 
of local agencies? You fear to miss the site with the very 
best offer? Wouldn't you wish to automatize these tiresome 
tasks? To unlock this data for automatic processing requires 
two keys: a key that allows us through the human-centric, 
scripted form interfaces of the web and a key to identify 
offers among all the other data on the web. In this paper, 
we focus on the former: A key to web forms, the gates to 
the deep web. Since these gates are designed for human ad- 
mission, they pose a plethora of challenges for automatic 
processing: Even web forms within a single domain denote 
search criteria differently, e.g., "address", "city", "town", 
and "neighborhood" all refer to locations, while other terms 
denote different criteria ambiguously, e.g., "tenure" might 
refer to the choice either between "freehold" vs. "leasehold" 
or between "buy" vs. "rent". Moreover, web forms present 
their criteria in different manners, e.g., for a choice among 
several options, a form may contain either a drop-down lists 
or a set of check boxes. Automatically understanding these 
variants to pass through forms is needed by a broad range 
of applications: crawling and surfacing the deep web 1 27] 
[2Qll8ll. interface and service integration 1351 . matching in- 
terfaces across domains I7ll32l. classifying the domain of 
web databases |4| for web site classification, sampling the 
contents of web databases 1211121. ontology enrichment and 
knowledge-base construction |25|, question answering for 
the deep web 1 19 |. In web engineering, automated form un- 
derstanding contributes, e.g., to web accessibility and us- 
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ability web source integration ifTOll . automated testing 
on form-related web applications. 

The form understanding problem has attracted a number 
of approaches l35[|32[[TQ[|23[[T7l , for a recent survey see (181 
[m . These approaches turn observations on common fea- 
tures of web forms (in general, across domains) into specifi- 
cally tailored algorithms and heuristics, but generally suffer 
from three major limitations: 

(1) Most approaches are domain independent and thus 
limited to observations that hold for forms across all do- 
mains. This limitation is acknowledged in 135112311171 . but 
addressed only through domain specific training data, if at 
all. Our evaluation supports 1 17 1 in that a set of generic de- 
sign rules underlies all domains, but that specific domains 
parameterise or adapt to these rules in ways uncommon to 
other domains. 

(2) Most approaches are limited in the classes of fea- 
tures they use in their heuristics and often based on a single 
sophisticated heuristics using one class of features, e.g., only 
visual features ifTOl or textual and field type features in ifTTl . 

(3) Heuristics are translated into monolithic algorithms 
limiting maintainability and adaptability. For example, 1 32] 
and 1 23 1 encode specific assumptions on the spatial distance 
and alignment of fields and labels, ITTl employs hard-coded 
token classes for certain concepts such as "min", "from" 
vs. "max", "to". 

To overcome these limitations, we present OPAL 
(ontology based web pattern analysis with logic), a domain- 
aware form understanding system that combines visual, tex- 
tual, and structural features with a thin layer of domain 
knowledge. The visual, textual, and structural features are 
combined in a domain-independent analysis to produce a 
highly accurate form labeling. However, for most applica- 
tions what is actually needed is a form model consistent 
with a (reference) schema of the forms in the given domain, 
where all the fields are associated with given types. In OPAL, 
the domain schema is not only used to classify the fields and 
segments of the form model, but also to improve the form 
model based on a set of structural constraints that describe 
typical fields and their arrangement in forms of the domain, 
e.g., how price ranges are presented in forms. To ease the 
development of these domain ontologies, OPAL extends Dat- 
alog with templates to enable reuse of common form pat- 
terns in forms, e.g., how ranges (of any type) are presented 
in forms. With this approach, OPAL achieves nearly perfect 
analysis results (> 97% accuracy). 

In contrast to previous approaches, OPAL produces rich 
form models, typed to the given domain schema: The mod- 
els contain not only types (and individual) constraints for 
form fields, but group those fields into semantic segments, 
possibly with inter-field constraints. These rich models ease 
the development of applications that interact with these 
forms. To demonstrate this, we have developed a light- 



weight form integration system on top of OPAL that fully 
automatically translates queries to the domain schema into 
queries to the concrete forms. 



1.1 Contributions 

opal's main contributions are: 

(1) Multi-scope domain-independent analysis (Sec- 
tion [3]) that combines structural, textual, and visual features 
to associate labels with fields into a form labeling using 
three sequential "scopes" increasing the size of the neigh- 
bourhood from a subtree to everything visually to the left 
and top of a field, (i) At field scope, we exploit the structure 
of the page between fields and labels; (ii) at segment scope, 
observations on fields in groups of similar fields, and (iii) at 
layout scope, the relative position of fields and texts in the 
visual rendering of the page. We impose a strict preference 
on these scopes to disambiguate competing labelings and to 
reduce the number of fields considered in later scopes. 

(2) Domain awareness. (Section |4]) OPAL is domain- 
aware while being as domain-independent as possible with- 
out sacrificing accuracy. This is based on the observation 
that generic rules contribute significantly to form under- 
standing, but nearly perfect accuracy is only achievable 
through an additional layer of domain knowledge. To this 
end, we add an optional, domain-dependent classification 
and form model repair stage after the domain-independent 
analysis. Driven by a domain schema, OPAL classifies form 
fields based on textual annotations of their labels and val- 
ues assigned in the domain-independent form labeling, as 
well as the structure of that form labeling. This classifica- 
tion is often imperfect due to missing or misunderstood la- 
bels. OPAL addresses this in a repair step, where structural 
constraints are used to disambiguate and complete the clas- 
sification and reshape the form segmentation. 

(3) Template Language OPAL-TL. (Section]?]!]) To spec- 
ify a domain schema, we introduce OPAL-TL. It extends Dat- 
alog to express common patterns as parameterizable tem- 
plates, e.g., describing a group consisting of a minimum and 
maximum field for some domain type. Together with some 
convenience features for querying the field labeling and its 
annotations, OPAL-TL allows for very compact, declarative 
specification of domain schemata. We also provide a tem- 
plate library of common phenomena, such that the adaption 
to new domains often requires only instantiating these tem- 
plates with domain specific types. OPAL-TL preserves the 
complexity of Datalog. 

(4) Methodology for Deriving Domain Schemata. (Sec- 
tion [44]) To ease the derivation of an OPAL domain schema, 
we present a simple, step-by-step methodology how to de- 
rive such a schema from a standard domain ontology. It is 
based on the observation that often the types of the proper- 
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(c) Interpretation 



Fig. 1: Colin Mason with OPAL (see Figure [2]for segment scope) 



ties (such as price or mileage of a car) in the domain ontol- 
ogy determine the configuration of form fields for that type. 

(5) Light-weight Form Integration. (Section [5]) To 
demonstrate the value of opal's rich form models, we im- 
plement a form integration system on top of OPAL that auto- 
matically translates a master query to hundreds of concrete 
forms. As shown in the evaluation, even with rather simple 
translation rules, we achieve accurate form filling. 

(6) Extensive Evaluation. (Section [6]) In an evaluation 
on over 700 forms of four different datasets, we show that 
OPAL achieves highly accurate (> 95%) form labelings and, 
with a suitable domain schema, near perfect accuracy in 
form classification (> 97%). To compare with existing ap- 
proaches (which only perform form labeling), we show that 
opal's domain-independent analysis achieves 94 — 100% 
accuracy on the ICQ benchmark and 92 — 97% on TEL- 8. 
Thus, even without domain knowledge OPAL outperforms 
existing approaches by at least 5%. We also show that the 
form integration system developed on top of OPAL is able to 
fill forms correctly in nearly all cases (> 93%) 

We believe that OPAL offers a comprehensive solution 
to form understanding for most applications, but also dis- 
cuss, in Section [8j the two major remaining challenges for 
OPAL (and form understanding, in general): highly scripted, 
interactive forms, increasingly also using customised form 
widgets, as well as richer integrity constraints and access re- 
strictions, in particular for applications that aim to extract all 
of the data behind a form. 



This paper is based on |[T2ll . but has been significantly 
extended in every part, in particular in the following three 
aspects: First, OPAL-tl is only sketched in |12|. Section [4] 
is the first formal definition of OPAL-TL, including a full 
rewriting semantics. It has also been extended significantly, 
most importantly in the supported template features (predi- 
cate variables and template groups). Second, we have added 
a more detailed description of an OPAL domain schema and 
form model to better illustrate how OPAL operates and what 
the output of form understanding looks like. Finally, we have 
implemented a full, though light-weight, form integration 
and filling system on top of OPAL (SectionjSj to demonstrate 
the value of opal's rich models. We have also significantly 
extended the evaluation to show the results of the form inte- 
gration, as well as to discuss where and why a small portion 
of forms still pose a challenge to OPAL. 



1.2 opal: a Walkthrough 

We present the OPAL approach to form understanding us- 
ing the form from the UK real estate agency Colin Mason 



(cmea.co.uk/properties.asp). Figure [Ta| shows the web 
page with its simplified CSS box model. The page contains 
two forms (center and left): one for detailed search and the 
other for quick search. OPAL is able to identify, separate, 
label, and classify both forms correctly yielding two (real- 
estate) form models. The following discussion focuses on 
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(b) Labeling 2a: Segments 
Fig. 2: Colin Mason segment scope 
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(c) Labeling 2b: Fields by Segment 



the search form in the center of Figure [Ta| in which each of 
the components (I)-(IO), each of the fields (3)-(7) and the 
two columns of checkboxes in (2) are enclosed in a table, 
tr, or td element. Labels for each of the components such 
as "Bedrooms:" appear in separate t r's. 

opal's form understanding operates in two parts: Form 
labeling and form interpretation. In the form labeling phase 
fields and groups of fields (called segments) are assigned 
text labels. In the form interpretation phase those text la- 
bels are used to classify the fields and segments on the 
page, eventually verifying and repairing the label assign- 
ment and producing a form model in line with the given 
domain schema. Form labeling itself is split into field, seg- 
ment, and layout scope, each assigning successively labels 
to more fields and segments of a form. 

Field scope. (SectionjTT]) OPAL starts by analysing indi- 
vidual fields assigning labels in two ways: First, we add la- 
bels that explicit reference the field (using the f o r attribute). 
Second, we add labels where the common ancestor with a 
field has no other fields as descendant. In our example from 
Figure [Ta| no explicit references occur, but the second ap- 
proach correctly labels all fields except the checkboxes in 
(2). In Figure [lb] we show this initial form labeling using 
same color for fields and their labels. 



Segment scope. (Section [3^ In segment scope, we in- 
crease the scope of the analysis from form fields to groups of 
similar fields (called segments). OPAL constructs these seg- 
ments from the HTML structure, but eliminates segments 
that likely have no semantic relevance and are only in- 
troduced, e.g., for formatting reasons. This elimination is 
primarily based on semantic similarity between contained 
fields approximated via semantic attributes such as class 
and visual similarity. In our example, components (2)-(7) 
become segments, with (2) further divided into two seg- 
ments for each of the vertical checkbox groups, as shown in 



Figure 2a This rough, approximate segmentation may later 
be corrected in the form interpretation. 



For each segment as a whole, OPAL associates text nodes 
to create segment labels. Segment labels can be useful to re- 
pair the form model and to classify fields that have no labels 
otherwise. In this example, OPAL assigns the text in bold 
face appearing atop each segment as the label, e.g., "Price:" 
becomes the label for (4), see Figure|2b] Furthermore, within 
each segment, OPAL identifies repeated groups of interleav- 
ing fields and texts. In the example, each check box in (2) 
is labeled with the text appearing after it, as shown in Fig- 
ure [2cl 
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Layout scope. (Section |3.3[ ) In the layout scope, OPAL 
further enlarges the scope of the analysis to all fields visually 
to the left and above a field. The primary challenge in this 
scope is "overshadowing", i.e., if other fields appear in the 
quadrants to the left and above a field. In this example the 
layout scope is not needed. 

The result of the layout scope is the form labeling. No- 
tice, that the form labeling is entirely domain independent. 

Domain scope. If a form model is required, the final 
step in OPAL produces Siform model that is consistent with a 
given domain schema. How to derive such a domain schema 



and the necessary annotators is discussed in Section 4.4 It 
uses domain knowledge to classify and repair the labeling 
and segmentation from the form labeling. In the classifi- 
cation step, OPAL annotates fields and segments with types 
based on annotations of the text labels. The verification step 
repairs and verifies the domain model if needed. For both 
steps, OPAL uses constraints specified in OPAL-TL. These 
constraints model typical representations of types in a do- 
main. E.g., the first field in (4) is classified as min_price as 
we recognise this segment as an instance of a price range 
template. These constraints also disambiguate between mul- 
tiple, conflicting annotations, e.g., fields in (6) are annotated 
with order_by and price, but the price annotation is disre- 
garded due to the group label. Even without the group la- 
bel, price would be disregarded as the domain schema gives 
precedence to order_by over price due to the observation that 
if they occur together, the field is likely about "order by 
price" and not about actual prices. Finally, a single repair is 
performed in this case: We collapse the two checkbox seg- 
ments in (2) as they are the only children of their parent seg- 



ment and both of the same type. Figure Ic shows the final 
field classification as produced by OPAL. 

Form integration and filling. Using the form interpre- 
tation constructed in the preceding stages, OPAL is able to 
map a master query formulated on the domain schema into 
both of the concrete forms on this page (see Figure [la]). For 
location, the values are typed in directly. For price, the range 
in the master query can also be directly entered, as the con- 
crete forms use text inputs for prices and opal's form inter- 
pretation identifies the min and max price field successfully. 
For the bedroom number, the value from the master query 
is compared with the members of the check box list and the 
most similar is selected. 



2 Approach 

OPAL constructs a model of a form consistent with a domain 
schema. A domain schema describes how forms in a given 
conceptual domain, such as the UK real estate domain, are 
structured. OPAL divides this problem ("form understand- 
ing") into form labeling and form interpretation. The form 
labeling identifies forms and their fields, arranges the fields 



into a tree, and labels the found fields, segments, and forms 
with text nodes from the page. The form interpretation aligns 
a form labeling with the given domain schema and thereby 
classifies the form fields based on their labels. 



2.1 Problem Definition 

Form Labeling. A web page is a DOM tree P = 

((^)[/GUnary7 ^child 7 ^next-sibh ^attribute) whcrC (f/)[/GUnary ^^C 

unary type and label relations, 7?chiid is the parent-child, 
^next-sibi th^ direct next sibling, and 7?attribute the attribute rela- 
tion. Further XPath relations (such as descendant) are derived 
from these basic relations as usual 1 6 1 . U contains relations 
for types as in XPath (element, text, attribute, etc.) and three 
kinds of label relations, namely tag^ for tags of elements and 
attributes, text^ for text nodes containing string /, and box^ 
for elements with bounding box /? in a canonical rendering 
of the page. For consistency with elements, we represent the 
value of an attribute as text child node of the attribute. 

Definition 1 A form labeling of a web page P is a tree F 
with functions (representative) and £a (label), such that 
maps the nodes of F into P. Leafs in F are mapped to 
form fields and inner nodes to form segments, that is a DOM 
element grouping a set of fields. Each node ^ in F is also 
mapped to a set £a(^) of text nodes, the labels of n. 

A node can be labeled with no, one, or many labels via 
£a. The form labeling contains a representative (via Die) 
for each form. A representative contains all fields (and seg- 
ments) of that form. This allows OPAL to distinguish mul- 
tiple forms on a single page, even if no form element is 
present or multiple forms occur in a single form element. 

Definition 2 Given a web page P, the form labeling prob- 
lem (or schema-less form understanding problem) asks for 
a form labeling F where for each form f inP 

(1) there is a node r e F such that 9^e(r) is a suitable repre- 
sentative of / and 

(2) for each field e in /, there exists a leaf node ne^F such 
that ne is a descendant of r and 9^e(^e) = e where Za{ne) 
is a suitable label set for e. 

(3) for each inner node s in F (form segment), 2^a{s) is a 
suitable set of labels for the set of fields contained in s. 

The suitability of a form representative 9^e(r) and a label 
set 2^a{ne) is not defined formally, but needs to be evaluated 
by human annotators (which this, after all, aims to simulate). 
Our evaluation (Section [6]) shows that OPAL produces form 
labelings Ff that match the gold standard in nearly all cases 
(> 95% without using any domain knowledge). 

We call a form labeling complete for a web page, if, for 
all e, 2^a{ne) contains all text nodes suitable as labels for e. 
Finding such a form labeling is correspondingly called the 
complete form labeling problem. 
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Fig. 3: OPAL Overview 



Form Interpretation. To define the form interpretation prob- 
lem, we formaHze the notion of domain schema and intro- 
duce a form model as a form labeling extended with type 
information consistent with a given domain schema. First, 
we define the part of a domain schema that provides the 
necessary knowledge to interpret text nodes ("annotation 
schema"): 

Definitions An annotation schema A = 

^,(isLabela, isValue^ : a ^ A)) defines a set A of an- 
notation types, a transitive, reflexive subclass relation a 
transitive, irreflexive, antisymmetric precedence relation ^, 
and two characteristic functions isLabel^ and isValue^ on text 
nodes for each A. 

For each annotation type a G ^, we distinguish proper 
labels and values, with isLabel^ and isValue^ as corresponding 
characteristic functions. Proper labels are text nodes, such as 
"Price:", describing the field type; values, such as ''more 
than £500 '\ contain possible values of the field. Hence 
isLabelpr/ce( "Price:") and isValuepr/ceC'more than £500") hold. 

The □ relation holds for subtypes, e.g., postcode \Z 
location, and the ^ relation defines precedence on annotation 
types used to disambiguate competing annotations. For ex- 
ample, an unlabeled select box with options ''Choose sort- 
ing order", "By price", and "By postcode" may be anno- 
tated with order-by, price, and postcode. If order-by ^ price and 
order-by ^ postcode, we pick order-by. 

Definition 4 A domain schema E = {A^Tj-OjCj-^Ca) de- 
fines an annotation schema A , a set of domain types T with 
(transitive, reflexive) part-of relation -o, and Cj- and Ca 
that map domain types to classification and structural con- 
straints. 

For example, Ca (price) requires an annotation price and 
prohibit any annotation of a type with precedence over price 
(such as order-by above). The set of structural constraints 



C7-(price-range) for a price-range segment requires a min- 
PRiCE and MAX-PRICE field or a price-range field. We write 
S 1= C, if a constraint set C is satisfied by a set S of anno- 
tation or domain types. The empty constraint set is always 
satisfied, -o plays an important role in the definition of the 
constraints, as it prescribes the structure of the types in the 
domain. For details on constraints and how to define them, 
see Section lU 

Formally, a form interpretation (F, t) is a form label- 
ing F with a partial type-of relation T, relating nodes in F 
with the types T of E. Given a node ^ in we denote with 
A{n) = {a G : 3/ G £a(^) with isValuea(/) or isLabela(/)} 
the set of annotation types associated with n via its la- 
bels, and with child-T{n) = U(«,«')Gi^ ^{^^) the set of domain 
types of the children of n. 

Definition 5 A form interpretation (F, t) is a form model 

for E, iff A{n) |= Ca (0 and child-Tin) |= Cr (0 for aU n e 
F,t e T{n). 

Definition 6 Given a domain schema E and a form labeling 
F, the form interpretation problem asks for a form model 
(F', t) for E such that F' differs from F only in inner nodes. 

Thus, form representatives, fields, and labels are shared be- 
tween F and F\ but the form segments may be rearranged 
to conform with the structural constraints of E. 

Definition 7 Given a domain schema E and a web page P, 
the (schema-based) form understanding problem asks for 
a form model (F, t) of P under E, such that F is a solution 
of the complete form labeling problem for P. 

Form Integration and Filling. In web interface integration 
a query against a global domain schema is translated and 
executed on concrete forms. The returned data is translated 
into the domain schema and returned. We focus here on the 
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first part of the integration problem, the query translation 
or form integration problem, and more specifically on its 
optimistic variant: 

Let Z be a domain schema. Then a query g on Z is 
a set of unary constraints on T, the domain types in Z. 
We consider three types of constraints: (1) Equality con- 
straints such as POSTCODE = 0X1; (2) range constraints 
such as PRICE G [700,1250]; (3) inclusion constraints such 
as COLOUR G {red, green, black}. 

Definition 8 Given a domain schema a query 2 on 
and a concrete form F, the form integration problem is the 

problem to translate Q into a (single) query Q' on F such 
that Q' returns all results that match Q and can be retrieved 
by F and that there is no other query on F with that property 
that returns less results. 

Note, that we do not require that Q' returns only results 
that match g, but that its result set is minimal among all 
queries on F that return all matches for Q that can be re- 
trieved by F. This is necessary, as there may be no query on 
F that is able to exactly express Q. 

2.2 OPAL Architecture 

OPAL is divided into three parts. Of those, two form opal's 
form understanding: a domain-independent part to address 
the form labeling problem and a domain-dependent part for 
form interpretation according to a domain schema. The re- 
maining part is devoted to form integration and translates 
queries against a domain schema into queries on concrete 
forms. 

OPAL produces form labelings in a novel multi-scope ap- 
proach that incrementally constructs a form labeling com- 
bining textual, structural, and visual features (Figure |3]). 
Each of the three labeling scopes considers features not con- 
sidered in prior scopes: 

(1) Infield scope, we consider only fields and their im- 
mediate neighbourhood and thus use only the DOM tree as 
input. 

(2) In segment scope, we detect and arrange form seg- 
ments into a segment tree to interleave the contained text 
nodes and fields. 

(3) In layout scope, we broaden the potential labels of a 
field by searching in the layout tree, i.e., the visual rendering 
of the page, and assign text nodes to fields, given a strong 
visual relation. 

Each scope builds on the partial form labeling of the pre- 
vious scope and uses the information from the additional 
input to find labels for previously unlabeled fields (or seg- 
ments). Only the segment scope adds nodes, namely form 
segments, whereas field and layout scope only add labels. 

Finally, in the (4) form interpretation (Section |4]) we 
turn the form labeling produced by the first three scopes 
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Algorithm 1: FieldScopeLabelling(Z)(9MP) 

1 foreach field f in P do 

2 n^f\ 

3 while n has a parent do 

4 if ^ is already coloured then colour n red; break; 

5 colour n ; 

6 n^ parent of n; 

7 F ^ empty form labeling ; 

8 {oreach field f in P do 

9 n^ new leaf node in F \ 

10 ^t{n)^f\ 

11 if 3/ G P with for attribute referencing f then 

12 |_ assign all text node descendants of / as labels to n ; 

13 p ^ parent of /; 

14 while p not coloured red do 

15 |_ / ^ P ^ parent of /; 

16 assign all text node descendants of / as labels to n ; 



into a form model consistent with a given domain schema. 

(i) The labeling model is extended with (domain- specific) an- 
notations on the textual content of proper labels and values. 

(ii) Fields and segments of the form labeling are classified ac- 
cording to classification constraints in the domain schema. 

(iii) Finally, violations of structural schema constraints are 
repaired in a top-down fashion. 

Types and constraints of the domain schema are speci- 
fied using OPAL-TL, an extension of Datalog that combines 
easy querying of the form labeling and of annotations with a 
rich template system. Datalog rules already ease the reuse of 
common types and their constraints, but the template exten- 
sion enables the formulation of generic templates for such 
types and constraints that are instantiated for concrete types 
of a domain. An example of a type template is the range tem- 
plate, that describes typical ways for specifying range values 
in forms. In the real estate domain it is instantiated, e.g., for 
price and various room ranges. In the used car domain, we 
also find ranges for engine size, mileage, etc. Thus, creat- 
ing a domain schema is in many cases as easy as importing 
common types and instantiating templates, see in Section [4] 

The form understanding part of OPAL is complemented 
with a form integration part, where we translate a given 
query on the domain schema into queries on concrete forms. 
To do so, we construct an OPAL form model as above and 
then use that form model to map the constraints of the given 
query to fields on the concrete form. The form is then filled 
according to the constraints. Where a constraint can not be 
mapped precisely, we use standard similarity techniques to 
find the closest, inclusive option (in case of numerical types) 
or just the closest option (in case of categorial types), see 
Section [21 
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3 Form Labeling 

In OPAL, form labeling is split into three scopes. Each scope 
is focused on a particular class of features (e.g., visual, struc- 
tural, textual). The form labeling scopes, field, segment, and 
layout scope, use domain-independent labeling techniques 
to associate form fields or segments with textual labels, 
building a form labeling If a domain schema is avail- 
able, the form labeling is extended to a form model in the 
domain-dependent analysis (Section]?]). 

The form labeling F is constructed bottom-up, applying 
each scope's technique in sequence to yet unlabelled fields. 
Whenever a field is labelled at a certain scope level, further 
scopes do not consider this field again. This precedence or- 
der reflects higher confidence in earlier scopes and addresses 
competing label assignments. 

3.1 Field Scope 

Based on the DOM tree of the input page, the field scope 
assigns text nodes in a unique structural relation to individ- 
ual fields as labels to these fields (see Algorithm [T]). It relies 
on the observation that, if a text node shares a sub-tree of 
the DOM with a single field only, then that text node is most 
likely related to that field. This simple observation produces 
a significant portion of form labels, as shown in Section [6] 
and is designed to produce nearly no false positives, as also 
verified in Section [6l 

Specifically, Algorithm [l] (1) colours (lines 1-6) all 
nodes in P that are ancestors of a field and do not have other 
form fields as descendants in orange. The least ancestor that 
violates that condition is coloured red. (2) It identifies (line 
7-10) all form fields and initialises the form labeling F with 
one leaf node for each such field. (3) It considers (lines 11- 
12) explicit HTML label elements with direct reference to 
a form field. (4) It labels (lines 13-16) each field / with all 
text nodes t whose least common ancestor with / has no 
other form field as descendant. This includes all text nodes 
t in the content of / such as its values (in case of select, 
input, or textarea elements), since the least common an- 
cestor of t and / is / itself. We find these text nodes in linear 
time due to the tree colouring. 

3.2 Segment Scope 

At segment scope, the labeling analysis expands from in- 
dividual fields to form segments, i.e., groups of consecu- 
tive fields with a common parent, forming the segment tree 
(Algorithm |2]). These segments are then used to distribute 
text nodes to unlabeled fields in that segment (Algorithm |3]). 
At this scope, we approximate form segments through the 
DOM structure and the style of the contained fields. This 



Algorithm 2: SegmentTree(Z)OMP) 

1 P'^P\ 

2 while 3neP' :n not afield A ( Afield d : Rdescendant{d , n) G P') 
do 

3 ^ delete n and all incident edges from P'\ 

4 while 3neP' :\{ceP' : Rchiid{c,n) G = 1 do 

5 ^ delete n from P' and move its child to the parent of n; 

6 foreach inner node n in P' in bottom-up order do 

7 C ^ {/ : /?chiid (/, ^) G P' A / is a field} ; 

8 C ^ CU {Representative(^') : RowmW ,n) G P'}} ; 

9 choose r G C arbitrarily ; 

10 if Vr' G C : r style-equivalent to r' then 

11 Representative (^) ^ r; 

12 delete all non-field children of n and move their 
children to n\ 



13 else Representative (^) ^ ± ; 

14 return P'; 




DOM Tree Segment Tree 



Fig. 4: Example DOM and Segment Tree 

segmentation is later adjusted to yield only form segments 
with a clear semantic. It is worth noting, that on many forms 
only very few adjustments are necessary, supporting the ve- 
racity of the approximation of semantic segments through 
structure and style. 

Segmentation tree. We observe that the DOM is often a fair, 
but noisy approximation of the semantic form structure, as 
it reflects the way the form author grouped fields into seg- 
ments. Therefore, we start from the DOM structure to find 
the form segments, but we eliminate all nodes that can be 
safely identified as superfluous: nodes without field descen- 
dants, nodes with only one child, and nodes n where all 
fields in n are style-equivalent to the fields in the siblings 
of n. Two fields are style-equivalent if they carry the same 
class attribute (indicating a formatting or semantic class) or 
the same type attribute and CSS style information. 

If all field descendants of the parent of an inner node n 
are style-equivalent, then n should be eliminated from the 
segment tree, as it artificially breaks up the sequence of 
style-equivalent fields and is thus equivalence breaking. 

Definition 9 The segment tree P' of a form page P is the 
maximal DOM tree included in P (i.e., obtained by collaps- 
ing nodes) such that the leaves of P' are all fields and, for all 
its inner nodes n, 

(1) |{cGP':7?chiid(c,n)}|>l, 

(2) n is not equivalence breaking. 
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As an example, consider the DOM tree on the left of Fig- 
ure]?] where diamonds represent fields and style-equivalent 
fields carry the same colour. On the right hand side, we show 
opal's segment tree for that DOM. Nodes 1 and 3 from the 
original DOM are eliminated as they have only one child, 
and node 2 as it is equivalence breaking. Nodes 4 and 5 are 
retained due to the red field. 

Theorem 1 The segment tree P' of a web page P can he 
computed in 0{nx d) with n size and d depth of P. 

Proof Algorithm [2] computes the segment tree P' for any 
DOM tree P. Its leafs are fields (as any non field leafs are 
eliminated in line 2-3) and any inner node must have more 
than 1 child (due to line 4-5), a field descendant (due to 
line 2-3), and not be equivalence breaking (due to lines 6- 
13). In lines 6-13, we compute a Representative, bearing the 
style prevalent among the inner node's fields, for each in- 
ner node in a bottom-up fashion: If all field children (line 
7) and the representatives of all inner children (line 8) are 
style-equivalent (line 9-10), we choose an arbitrary repre- 
sentative and collapse all inner children of that node. Note, 
that it suffices to compare any of the representatives with the 
fields in C as style-equivalence is transitive. Otherwise, we 
assign ± as representative, which is style-equivalent neither 
to any node nor to itself. Thus it prevents this node (and its 
ancestors) from ever being collapsed. By construction, these 
nodes respect (1) and (2) and this property is retained in all 
later steps, as their subtrees are never touched again. 

P^ is maximal: Any tree P^^ that includes P' but is in- 
cluded in P must contain at least one node from P that has 
been deleted by one of the above conditions. Such a node, 
however, violates at least one of the conditions for a segment 
tree and thus P^' is not a segment tree. This holds because the 
order of the node deletions does not affect the nodes deleted. 

Algorithm |2] runs in 0{n x d): Lines 2-3 are in 0{n). 
Lines 4-5 and lines 6-13 are both in 0{n x d) as they are 
dominated by the collapsing of the nodes. At most, we col- 
lapse d — 2 inner nodes and move 0{n) leaves d — 2 times. 



Segment Labeling. We extend the existing form labeling 
F of the field scope with form segments according to the 
structure of the segment tree and distribute labels in reg- 
ular groups, see Algorithm [3] First (lines 2-5), we create 
a form segment node s in the form labeling for each inner 
node ns in the segment tree and choose ns as representative 
for s (^\t{s) = ns). For each segment with regular interleav- 
ing of text nodes with field or segment nodes, we use those 
text nodes as labels for these nodes, preserving any already 
assigned labels and fields (from field scope). In detail, we 
iterate over all descendants c of each segment in document 
order, skipping any nodes that are descendants of another 



Algorithms: SegmentLabeling(Z)OM P,F6>rm Labeling F) 

1 5 ^ SegmentTree(P) ; 

2 foreach inner node s in S in bottom-up order do 
create a new segment ns 'mF; 

yit(ns) ^ s; 

create an edge (ns.Cs) in F for every ^c{cs) child of s; 

foreach segment n in F do 

Nodes, Labels ^ new List{); 
textGrp ^ ; 

foreach c : Rdescendant{c ^ '0\z{n)) G P in document order do 
if 3/ G F : lHe(/) = c A £a(/) = then 
if textGrp ^ then Labels, add (textGrp); 
textGrp ^ 0; 
Nodes. add(c); 

skip all descendants of c in the iteration ; 

else if c is a text node A ^d e F : c e £a(<i) then 
|_ textGrp^ textGrp U{c}; 

if textGrp 7^ then Labels, add (textGrp); textGrp ^ 0; 
if Labels. size() = Nodes. size() + 1 then 
add Labels [0] to£a(^); 
delete Labels [0] from Labels; 

if Labels. size() = Nodes. size() then 
|_ foreach / do add Labels[/] to i2a(Nodes[/]); 




Fig. 5: Example for Segment Scope Labeling 



segment or field itself contained in n (line 13). In the itera- 
tion, we collect all field or segment nodes in Nodes, and all 
sets of text nodes between field or segment nodes in Labels, 
except those already assigned in field scope (line 14), as we 
assume that these are outliers in the regular structure of the 
segment. We assign the i-th text node group to the i-th field, 
if the two lists have the same size (possibly using the first 
group as labels of the segment, line 17-19). 

Figure [5] illustrates the segment scope labeling with tri- 
angles standing for text nodes, diamonds for fields, black 
circles for segments, and white circles for DOM nodes not 
in the segment tree. The numbers indicate which text nodes 
are assigned as labels to which segments or fields. E.g., for 
the left hand segment, we observe a regular structure of (text 
node+, field) + and thus we assign the i-th group of text 
nodes to the i-th field. For the right hand segment (4), we 
find a subsegment (5) and field 8 that is already labeled with 
text node 8 in the field scope. Thus 8 is ignored and only one 
text node remains directly in 4, which becomes the segment 
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label. In 5, we find one more text node group than fields 
and thus consider the first text node group as a segment la- 
bel. The remaining nodes have a regular structure (field, text 
node+)+ and get assigned accordingly. 



3.3 Layout Scope 

At layout scope, we further refine the form labeling for each 
form field not yet labelled in field or segment scope, by 
exploring the visible text nodes in the west, north-west, or 
north quadrant, if they are not overshadowed by any other 
field. To avoid false positives, we limit this search to the 
boundaries of the enclosing form. First, OPAL constructs a 
layout tree from the CSS box labels of the DOM nodes: 

Definition 10 The layout tree of a given DOM P is a tuple 
{Np^ <, w, nw, n, ne,e,se,s,sw, aligned) where Np is the set of 
DOM nodes from P, <],w, nw, n, . . . the "belongs to" (con- 
tainment), west, north-west, north, . . . relations from RCR 
IS], and aligned (x,};) holds if x and y have the same height 
and are horizontally aligned. 

We call w, nw, . . . the neighbour relations. The layout tree 
is at most quadratic in size of a given DOM P and can be 
computed in 0(|Pp). For convenience, we write, e.g., w-nw-n 
to denote the union of the relations w, nw, and n. 

In cultures with left-to-right reading direction, we ob- 
serve a strong preference for placing labels in the w-nw-n re- 
gion from a field. However, forms often have many fields 
interspersed with field labels and segment labels. Thus we 
have to carefully consider overshadowing. Intuitively, for a 
field /, a visible text node t is overshadowed by another field 
f if t is above f or also visible from, but closer to f. In the 
particular case of aligned fields, the former would prevent 
any labeling for these fields and thus we relax the condition. 

Definition 11 For a given text node t, a field f overshad- 
ows another field / if 

(1) / and f are unaligned, w-nw-n (/',/), and 
w-nw-n-ne-e(^,/') or 

(2) / and f are aligned and (i) \N{t^f) or (ii) nw-n(^,/') and 
there is a text node not overshadowed by another field 
with ne-e(^',/') and w-nw-n(r',/). 

To illustrate this overshadowing, consider the example in 
Figure |6] For field Fi, T2 and T4 are overshadowed by F2 and 
73 by 7^3, only 7i is not overshadowed, as there is no other 
text node that is west, north-west, or north from Fi and not 
overshadowed by another field. 

The layout scope labeling is then produced as follows: 
For each field /, we collect all text nodes t with w-nw-n (^,/) 
and add them as labels to / if they are not overshadowed 
by another field and not contained in a segment that is no 
ancestor of /. The latter prevents assignment of labels from 
unrelated form segments. 



NW : N : NE 



W I Fi w\ E 

SW : S : SE 



Fig. 6: Layout Scope Labeling 



4 Form Interpretation 

There is no straightforward relationship between form 
fields for domain concepts, such as location or price, and 
their structure within a form. Even seemingly domain- 
independent concepts, such as price, often exhibit domain 
specific peculiarities, such as "guide price", "current offers 
in excess", or payment periods in real estate, opal's domain 
schemata allow us to cover these specifics. We recall from 
Section [2] that a form model (F'^t) for a schema E is de- 
rived from a form labeling F by extending F with types and 
restructuring its inner nodes to fit the structural constraints 
ofE. 

OPAL performs form interpretation of a form labeling F 
in two steps: (1) the classification of nodes in F according to 
the domain types T to obtain a (partial) typing tp. This step 
relies on the annotation schema A and its typing of labels 
in F; (2) the model repair where the segmentation structure 
derived in the segmentation scope (Section |3.2| ) is aligned 
with the structure constraints of L to complete the typing. 

The effort for creating an OPAL domain schema may, at 
the first glance, appear considerable. However, not only do 



we provide OPAL-TL (Section [4J] ) to ease the specification 
of a domain schema, we also discuss in Section |44l how all 
the artefacts needed by OPAL for a new domain can be nearly 
automatically derived from a standard ontology of a domain 
(including concept labels) and a set of entity recognisers (or 
annotators) for instances of the concepts. We illustrate this 
methodology for domain instantiation along the example of 
the used car domain. 



4.1 Schema Design: OPAL-TL 

OPAL provides a template language, OPAL-TL, for easily 
specifying domain schemata reusing common concepts and 
their constraints as well as concept templates. To implement 
a new domain, we only need to provide (1) for each annoa- 
tion type a an annotator implementing isLabel^ and isValue^ 
and (2) an OPAL-TL specification of the domain types with 
their classification and structural constraints. The latter can 
be derived almost mechanically from the domain types as 
discussed in Section l44l 
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OPAL-TL extends Datalog with template capabilities and 
predefined predicates for convenient querying of annota- 
tions and DOM nodes. An OPAL-TL program is executed 
against a form labeling F and a DOM P. Relations from 
F and P are mapped in the obvious way to OPAL-TL. We 
only use child (descendant, resp.) for the child (descen- 
dant, resp.) relation in F. We extend document and sib- 
ling order from P to F\ follows (X,F) for XJ eF, if 
^following (^^(^), ^^(5^)) ^ P and no other node in F occurs 
between X and Y in document order; adjacent (X,y), if 
^next-sibling (^^(^),^^(^)) ^ P ov vicc vcrsa. Finally, we ab- 
breviate text^(9^e(X)) and tag^(9^e(X)) as "/"(X) andr(X). 

Annotation types and their queries. Annotations (instances 
of annotation types) are characterised by an external spec- 
ification of the characteristic functions isLabel^ and isValue^ 
for each a ^ A.\n the current version of OPAL, these func- 
tions are implemented with simple GATE (gate.ac.uk) 
gazetteers and transducers, that are either provided by hu- 
man domain experts or realised by access to external anno- 
tators and knowledge bases such as DBPedia and Freebase. 
Together they provide annotators for common domain types 
such as price, location, or date. Additional entity recognis- 
ers or annotators can be added easily, as described in Sec- 
tion 1131 

Annotations are used in annotation queries to select 
fields based on annotations on their labels and the labels of 
their segments: 

Definition 12 For a form labeling F on a DOM P and 

an annotation schema A with annotation types A, an 
OPAL-TL annotation query is an expression of the form 
X@A{d^p^e^m} where X is a first-order variable, A ^ A, 
and J, p, e, and m are annotation modifiers. An annotation 
query X@A/i with \i C {d,p,e,m\ holds forX G \A\i \ with 

\A[i\ = {n^ Fields : (A, ^) 7^ 0} \ Block^ (A) 
Fields = G P : 3 leaf / G F ^e(/)} 

{Allowedju {n) n ^ isLabe^/ if p G /i 

Al lowed (^) n (isLabel^/ UisValuey^/) otherwise 

\n \3A' \ |M^(A,;2)| < |M^(A',;2)|} if m G 11 
{n : 3A' : |M^(A,;2)| < |M^(A',;2)|} if e g 11 
[ otherwise 

\ £ja{n) ifJGjU 
I Sja{n) U £a(parent of n) otherwise 



U^{A,n)-- 

Blockju (A) : 
Allowedju {n) - 



Intuitively, an annotation query X@A returns all fields 
labeled with a label that is annotated with A. If the modifier 
d (direct) is not present, we also consider the (direct) seg- 
ment parents, otherwise only direct labels are considered. If 
the modifier p (proper) is present, only isLabeU is used, oth- 
erwise also isValueA. If the modifier e (exclusive) is present. 




Fig. 7: Example Form Labeling 



Price: 

price Min: 



I min 



Max: 

I max 

(a) 




(b) 



Fig. 8: Label Annotation Examples 



a node that fullfils all other conditions is still not returned, 
if there are more labels with annotations of a type that has 
precedence over A. If the modifier m (maximal) is present, 
no other type, regardless of precedence, may have more la- 
bels with annotations at the node. Since m excludes strictly 
more nodes than e, a query with both m and e returns the 
same nodes as that query without e. 

Consider the form labeling of Figure [7] under a schema 
with B ^ A. Labels are denoted with triangles, fields with di- 
amonds, segments with circles. Labels are further annotated 
with matching annotation types (here always only one), with 
value labels drawn as outlines only. Then, X@A{} matches 
3,4; X@A{e.,d} matches 4, but not 3 as 3 has more la- 
bels of B than of A and the exclusive modifier e is present; 
X@A{e^p} matches 3, but not 4 as the proper modifier p 
prevents the value labels in white to be considered. The lat- 
ter matches 3 despite the presence of ^, as we consider also 
the labels of the parent of 3 (since the direct modifier d is 
absent) and thus there are two A labels. 

Figure [8] shows a real-life example with the annotations 
produced by a typical set of annotators. In[8a| there are two 
text inputs for min and max price. However, the two labels 
"min" and "max" are the only directly associated text boxes 
and do not carry any information that indicates that these 
fields are about prices. This is available only when consid- 
ering the segment (and thus indirect) label "Price:". Thus, 
X@phce{d} returns the empty set, h\xiX@ pnce{} returns the 
two fields. In [8b] the drop-down menu for result ordering re- 
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ceives two price annotations, two bedroom annotations, and 
five order-by annotations. With order-by price, X@price{e} 
returns the emptyset, as the price annotations are "blocked" 
by the order-by annotations. 

OPAL-TL templates. OPAL-TL is a Datalog-based language 
for the definition of reusable templates of domain concepts. 
Examples of such templates are basic classification rules 
deriving a domain type from a conjunction of annotation 
types or min-max range templates where we look for multi- 
ple fields with related annotations in a group and some clue 
that they represent a range. In general, there are two types 
of such templates, one for classification constraints, one for 
structural constraints. The former specify relationships be- 
tween domain and annotation types, the latter the abstract 
structure of domain concepts. 

Definition 13 An OPAL-TL template is an expression of 
the form: TEMPLATE N<Tx,...Jk> i P\ ^ expr^. ... } where A/^ 
is the template name, Ti^...^Tk are template variables, pi 
is a template atom, expr^ a boolean formula over template 
atoms and annotation queries. A template atom p<i>{s) con- 
sists of a first-order predicate p, a sequence of terms = 
t\^...^tyi} (where ti is either a constant or a template vari- 
able), and a sequence of terms s = si^. . . where each Si 
is either a constant or a first-order variable. Template and 
first-order variables constitute two disjoint sets. Note that, 
if t is empty, then a template atom is a normal first-order 
atom. Moreover, when all terms t are constants, we say that 
the template atom is template- ground. 

Multiple rules with the same head can be used to express 
disjunction of their bodies. For convenience, we use V and 
over conjunctions, which are translated to Datalog^ as usual. 

As an example, the following template defines a family 
of constraints that associate the concept (domain type) C to 
a node whenever is labeled by an exclusive direct and 
proper annotation of type A. 



(program) 
{template) 



:= {{template) I {inst) I {trule) )+ 

= 'TEMPLATE' {id) '<' {tvar)+ '>' '{' {trule)+ '}' 



TEMPLATE basic_concept<C, A>{ concept<C>(N)<^N(aA{d,e,p} } 

An instantiation of a template tpl produces a set of rules 
where the template variables Ci , . . . , Q are assigned to val- 
ues v^^ , . . . , defined by a template instantiation expression 
of the form: 

INSTANTIATE tpl<Tx , . . . Jk> using {<v},...,v|> ... <v^,...,v^>} 

For example, the following expression instantiates 
basic_concept replacing C with type radius and A with an- 
notation type radius 



INSTANTIATE basic_concept<C , A> using {<radius, radius>} 
and produces the following instantiated rule: 



{inst) 


:= 'INSTANTIATE' {id) '<' {tvar)+ '>' 




'using' '{' ('<' {const)+ '>' 


{trule) 


;= {tatom) '^i — ' {tbody) 1 {inst) 


{tbody) 


:= {texpr) (',' {texpr))^ 


{texpr) 


:= {atom) 1 {annot) 1 {tatom) 1 {neg) 1 {disj) 


{annot) 


:= (var)'(a' '{' ('d' 1 'e' 1 'p' 1 'm')* '}' 


{tatom) 


:= {id) '<' ({tvar) 1 {const))+ '>' '(' {par)^ ')' 




1 '<' {tvar) '>' '(' {par)^ ')' 


{par) 


:= {var) 1 {tvar) 1 {const) 


{const) 


:= {type-id) 1 {annot-id) 1 {tag) 1 {string) 1 {id) 


{neg) 


■= '^' '(' {tbody) ')' 


{disj) 


:= '(' {tbody) 'V {tbody) ')' 



concept<RADi us> ( N H@radius{6 , e , p} 



Fig. 9: OPAL-TL syntax 

The full syntax of OPAL-TL is given in Figure |9] (with 
(string), (id), and (var) as in Datalog and (tvar), (type-id), 
(annot-id), (tag) template variables, domain types, annota- 
tion types, and HTML tags, respectively). 

The semantics of OPAL-TL is given by rewriting any set 
of templates Et into Datalog" programs, using assignments 
of template variables to constants specified by the instantia- 
tion rules, and by considering every template-ground predi- 
cate name as a new first-order predicate. Due to possible oc- 
currences of INSTANTIATE within templates, the instantiation 
must be repeated until there are no applicable instantiate 
rules. To ensure termination of the instantiation procedure, 
we do not allow recursive template instantiations. Proper- 
ties such as safety can be easily extended from Datalog" to 

OPAL-TL: 

Definition 14 A OPAL-TL template is safe, if every tem- 
plate variable that occurs in the body also occurs in the head 
of the template and every rule is safe, i.e., all first-order vari- 
ables that occur in the head or in a negative atom in the body, 
also occur in a positive atom in the body. 

Proposition 1 Let Zj be a set of safe OPAL-TL templates, 
and let S be an assignment specified by OPAL-TL instantia- 
tion rules, then any instantiation t{Et,S) is a safe Datalog^ 
program. 

In contrast to safety, stratification depends also on the 
instantiation and is therefore defined over the expanded pro- 
gram as usual. 

A natural question is now the complexity of computing 
the form model using OPAL-TL. This is related to the com- 
plexity of fact inference in OPAL-TL. 

Proposition 2 Fact inference in OPAL-TL is PTIME- 
complete in data complexity ( when Et and S are fixed) and 
EXPTlME-complete in combined complexity. 
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TEMPLATE concept_by_proper<C, A> {concept<C>(N)<^N(9A{d,e, p}} 

2 

TEMPLATE concept_by_segment<C, A>{concept<C>(N)<^N(aA{e, p}} 

4 

TEMPLATE concept_by_value<C , A> {concept<C>(N)<^N(aA{m} , 
6 ^(Ai 7^ A, N(aAi{d,e,p} V N(aAi{e,p}) } 

8 TEMPLATE concept_minmax<C, Cm/ A> { 

concept<CM>(Ni )^ child (Ni, G) , child (N2,G) , adjacent (Ni ,N2) , 
10 Ni@A{e,d}, (concept<C>(N2) V N2(aA{e,d}) 

concept<CM>(N2)<^child(Ni,G) ,child(N2,G) ,follows(N2,Ni) , 
12 concept<C>(Ni ) ,N2(9/'ange_coA7necfo/-{e,d},^(AH A,N2(aAi{d}) 

concept<CM>(Ni )^ child (Ni ,G) , child (N2,G) , adjacent (Ni ,N2) , 
14 Ni@A{e,p},N2(aA{e,p}, ((Ni(am/n{e,p},N2(aAT7ax{e,p}) 
V (Ni(aAr?ax{e,p},N2(aAr?/n{e,p})) 

Fig. 10: OPAL-TL classification templates 



Proof Consider a set of template atoms Z), a set of OPAL- 
TL templates over a set of template predicates IZt of 
at most arity and an assignment S of template variables 
to constants in a set Fj specified by OPAL-TL instantiation 
rules. The fact inference problem for Z), Lt, and S is to de- 
cide whether D U {T.t-,S) |= a, where a is a template atom. 
According to Proposition [TJ the problem can be reduced to 
fact inference in Datalog", i.e., deciding whether Z)UZ/) |= a 
where Hj) = t(Z7,«S) is the rewritten program. Clearly, the 
data complexity is PTIME-complete as for Datalog". Re- 
garding the combined complexity, recall that fact inference 
for a Datalog" program and a set of atoms D is EXP- 
TIME-complete since the maximum number of atoms that 
can be inferred is | IZd \ •{dom{D)Y where IZd is the set of 
predicates of Z/), dom{D) is the domain of D and w is the 
maximum arity of predicates in IZd- The rewriting t{EtjS) 
can generate at most | IZt \ -(^r)^ template-ground atoms 
that contribute to the signature of E^. Therefore, the num- 
ber of atoms that can be generated is 0(2^ • 2^) that is still 
exponential. The claim follows. 



4.2 Classification 

Classification is based on the classification constraints of 
the domain schema. In OPAL these constraints are speci- 
fied using OPAL-TL to enable reuse of domain concepts and 
templates. For instance, in the real estate and used car do- 
mains, we identify four templates that suffice to describe 
nearly all classification constraints. These templates effec- 
tively capture very common semantic entities in forms and 
are parametrized using domain knowledge. The building 
blocks are a domain type (or concept) C and an annotation 
type A that is used to define a classification constraint for C. 
None of these templates uses more than one annotation type 
as template parameter, though many query additional (but 
fixed) annotation types in their bodies. 



Figure [T0| shows the classification templates for real- 
estate and used car: (1) Concept by proper label. The first 
template captures direct classification of a node N with type 
C, if N matches X(aA{d,e,p}, i.e., has more proper labels of 
type A than of any other type A' with A' ^ A. This template 
is used by far most frequently, primarily for concepts with 
unambiguous proper labels. (2) Concept by segment label. 
The second template relaxes the requirement by considering 
also indirect labels (i.e., labels of the parent segment). In the 
real estate and used car domains, this template is instantiated 
primarily for control fields such as order_by or display_method 
(grid, list, map) where the possible values of the field are of- 
ten misleading (e.g., an order_by field may contain ''price", 
'location", etc. as values). (3) Concept by value label. The 
third template also considers value labels, but only if neither 
the first nor the second template can match. In that case, 
we infer that a field has type C, if the majority of its di- 
rect or indirect, value or proper labels are annotated with A. 
(4) Min-max concept. Web forms often show pairs of fields 
representing min-max values for a feature (e.g., the number 
of bedrooms of a property). We specify this template with 
three simple rules (line 5-12), that describe three configu- 
rations of segments with fields associated with value labels 
only (proper labels are captured by the first two templates). 
It is the only template with two concept template parame- 
ters, C and Cm where Cm □ C is the "minmax" variant of 
C. The first locates, adjacent pairs of such nodes or a single 
such node and one that is already classified as C. The second 
rule locates nodes where the second follows directly the first 
(already classified with C), has a range_connector (e.g., "from" 
or "to"), and is not annotated with an annotation type with 
precedence over A. The last rule also locates adjacent pairs 
of such nodes and classifies them with Cm if they carry a 
combination of min and max annotations. 

In addition to these templates, there is also a small num- 
ber of specific rules. In the real estate domain, e.g., we use 
the following rule to describe forms that use links (a ele- 
ments) for submission (rather than submit buttons). Identi- 
fying such a link (without probing and analysis of Javascript 
event handlers) is performed based on an annotation type 
for typical content, title (i.e., tooltip), or alt attribute of 
contained images. This is mostly, but not entirely domain 
independent (e.g., in real estate a "rent" link). 



concept<LiNK_BUTTON>(Ni )<^form(F) , descendant (Ni , F) ,link(Ni) , 
Ni@LiN K_BUTTON{d}, ^(descendant ( N2, F) , 
(concept<BUTTON>(N2) V follows(Ni ,N2) )) 



4.3 Model Repair 

With fields and segments classified, OPAL verifies and re- 
pairs the structure of the form according to structural con- 
straints on the segments, such that it fits to the domain 
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TEMPLATE segment<C>{ 
2 segment<C>(G)<;=outlier<C>(G) ,child(Ni,G) ,^(child(N2,G) , 
^(Ci -o C, concept<Ci>(N2) V segment<Ci>(N2) )) } 

4 

TEMPLATE segment. range<C , Cm> { 
6 segment<C>(G)<^outlier<C>(G) , concept<CM>(Ni ) , 

concept<CM>(N2),Ni 7^ N2 , child (Ni , G) , child (N2 , G) } 

8 

TEMPLATE segment_with_unique<C , U> { 
10 segment<C>(G)<;=outlier<C>(G) , child (Ni ,G) , concept<U>(Ni ,G) , 

^(Ci -o C, child(N2,G), Ni 7^ N2, 
12 ^(concept<Ci>(N2) Vsegment<Ci>(N2) )) . } 

14 TEMPLATE outlier<C>{ 

outlier<C>(G)^child(G,P) ,child(G^P) ,^(segment<C>(G') ) } 



Fig. 11: OPAL-TL structural constraints 



schema. As for classification constraints, we use OPAL-TL 
to specify the structural constraints. The actual verification 
and repair is also implemented in OPAL-TL, but since it is 
not domain independent, it is not exposed to the user for 
modification. Here, we first introduce typical structural con- 
straints and their templates and then outline the model repair 
algorithm, but omit the OPAL-TL rules. 

Structural constraints. The structural constraints and tem- 
plates in the real estate and used car domains are shown in 



Figure[TT](omitting only the instantiation as in the classifica- 
tion case). All segment templates require that there is an out- 
lier among the siblings of the segment: outlier<c>(G) holds 
if at least one of G's siblings is not a C segment. (1) Ba- 
sic segment. A segment is a C segment, if its children are 
only other segments or concepts typed with C. This is the 
dominant segmentation rules, used, e.g., for room, price, or 
PR0PERTY_TYPE in thc rcal estate domain. (2) Minmax segment. 
A segment is a C segment, if it has at least two field children 
typed with Cm where Cm □ C is the minmax type for C. This 
is used, e.g., for price and bedroom range segments. (3) Seg- 
ment with mandatory unique. A segment is a C segment, 
if its children are only segments or concepts typed with C 
except for one (mandatory) field child typed with U where 
U \^C. This is used, e.g., for geography segments where only 
one RADIUS may occur. 

Repairing form interpretations. The classification yields a 
form interpretation that is, however, not necessarily a 
model under and may contain violations of structural con- 
straints. We adapt the types of fields and segments and the 
segment hierarchy of F with the rewriting rules described 
below to construct a form model compliant with L. OPAL 
performs the rewriting in a stratified manner to guarantee 
termination and introduces at most n new segments where n 
is the number of fields in the form. 



(1) Under Segmentation: If there is a segment n with 
type t such that Cr{t) requires additional child segments of 
type ti^...^tk child-T{n), we try to partition the children 
of n into ^ + 1 partitions Pi , . . . such that Pf |= Cr{ti) 
and PnU {ti^. . . ^tk} 1= Cr{t). For each Pi we add a new 
segment node as child of n, classify it with ti, and move 
all nodes assigned to Pi from n to that segment. If there 
is a segment n without type or with type t, but for which 
child-T{n) ^Cr{t) and the above case can not be applied, 
then that segment may be split: If there are non-overlapping 
subsequences Ci of children of n, such that all children of n 
are covered and, for each c^, there is a type ti such that the 
types of Ci satisfy the constraints for ti, then we replace n 
with a sequence of segments, one for each c/ typed with ti. 
In practice, few cases of multiple under segmentations occur 
at the same node and we can limit the search space using a 
total order on T. We observe that the number of segments 
is bounded by the number of fields in the form and provide 
a pool of unused segments in the segmentation. This avoids 
the need for value invention in the model repair. 

(2) Over Segmentation: If there is a segment n of 
type t with children ci,...,Cyt such that \Jchild-T{ci) U 
Un'ec ^W) 1= ^t(0 where C is the set of children of n with- 
out c\...Ck, then we move the children of each Ci to n and 
delete all Ci. 

(3) Under Classification: If there is a segment n of type 
t with untyped children ci , . . . , Cy^ and corresponding types 
ti^...^tk such that child-T{n) U {^i , . . . , ^y^} |= Cr (0 

each Ci, child-T{ci) |= Cr{ti) holds, then we type Ci with ti. 

(4) Over Classification: If there is a segment node n 
of type t with child c typed ti and t2 such that {^i} U 
Uc'gC H ^t(0 where C is the set of children of n with- 
out c, we drop t2 from t(c). 

(5) Miss Classification: If there is a node n of type t 
where child-T{n) ^ Cq-{t), then we delete the classification 
of n as t. 



Figure [12] shows the segmentation and classification 
OPAL obtains for this form before model repair. There are 
several problems with this segmentation: 

(1) The min_ price and max_ price fields are not arranged 
into a range segment as no such node is present in the 
DOM. This is a case of under segmentation. Following the 
segment, range constraint, OPAL introduces a price range seg- 



ment to include both fields as in Figure 13a 



(2) The four radio buttons under "order by" are of two 
different domain types, i.e., order_by for the first two and 
display for the last two. Due to concept_by_segment from Fig- 
ure [T0| and the segment label "order by", the last two would 



also get classified as order_by, if not for display -< order_by. 
This is an example of the second case of under segmen- 
tation, where OPAL needs to split the existing segment as 
it is not supported by a structural constraint, but there are 
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Fig. 12: Farlowestates before model repair 
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(c) property type 

Fig. 13: Model Repair on Farlowestates Real Estate Form 



subsequences of children that can form valid segments (Fig- 
ure fTSbl). 



(3) As a result of the original segment with four radio 
buttons grouped together, the last two radio buttons in the 
four are also typed as order_by in addition to their display 
type. OPAL resolves this over classification by removing the 
oRDER_BY following thc restructuring of the segment. 

(4) The PROPERTY_TYPE scgmcnt is subdivided into two 
segments in the original segmentation, since OPAL identifies 
no style-equivalence among the six check boxes due to lack 
of similarity. However, two segments of property_type can 
not be contained in a single parent segment (due to outlier). 
Thus, the two segments are removed with all their children 
directly contained in the larger segment (Figure p3c| ). This 
is an example of over segmentation. 

(5) The segmentation obtained at segment scope pre- 
serves the two DOM nodes representing two form rows. 
However, in the domain schema, these nodes do not carry 
meaning, and thus are treated as over segmentation and re- 
moved. 



4.4 Domain Instantiation: Methodology and Example 

In this section, we demonstrate how to derive an OPAL do- 
main schema, which includes form specific concepts, from a 
given standard ontology of a domain. This is the typical way 
to instantiate a domain for use with OPAL. 



Figure 14 shows a simple ontology for the used car do- 
main (in the UK). Note, that most search forms are about 
searching for entities (double border in Figure [14]) by their 
properties (single border) such as price or mileage of a car. 
Therefore, most of the types in an OPAL domain schema cor- 
respond to such properties of entities in the domain. 

We observe that properties can be roughly distinguished 
into numerical, categorical, and free text according to their 
range and that these distinctions dictate to a large extent the 
expected form fields for searching by those properties. For a 
numerical property we expect, e.g., either a single text input 
or slider, two min-max fields for entering a range, or a set of 
checkboxes to select common values or ranges. Categorical 
properties, on the other hand, never exhibit range inputs. 

These observations are codified in the derivation tem- 



plates of Figure 15 These templates group typical instanti- 
ations for the above kinds of properties as well as for com- 



pound object types such as location in Figure 14 

(1) For an object type (engine), we instantiate only the 
segment<c> template, i.e., we allow segments, but not fields 
of this type. Such segments typically collect multiple prop- 
erties of the object type, e.g., engine_size and fuel_type. 

(2) For a free text type (e.g., address), we instantiate 
only the concept_by_proper<C,A> and concept_by_value<C,A> 
templates that allows fields, but not segments of that type. 
There is usually no need for a segment in this case, as there 
are rarely multiple occurrences of fields for such a type. In 
the rare case where that is nevertheless possible, we instan- 
tiate segment<c> separately. 

(3) For a categorial type (make or colour), we instanti- 
ate in addition to concept_by_proper<c,A> also segment<c> and 
the concept_by_segment<c,A>. Categorical types are often rep- 
resented as single select boxes or lists of radio buttons or 
check boxes. For the latter, an enclosing segment is desir- 
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Fig. 14: Used car ontology 



TEMPLATE object_type<C> { 
2 INSTANTIATE segment<C> using { <C> } } 

4 TEMPLATE f ree_text_type<C, A> { 

INSTANTIATE concept_by_proper<C, A> using { <C,A> } 
6 INSTANTIATE concept_by_value<C, A> using { <C,A> } } 

8 TEMPLATE categorical_type<C,A> { 

INSTANTIATE concept_by_proper<C, A> using { <C,A> } 
10 INSTANTIATE concept_by_segment<C, A> using { <C,A> } 

INSTANTIATE concept_by_value<C, A> using { <C,A> } 
12 INSTANTIATE segment<C> using { <C> } } 

14 TEMPLATE numeric_type<C , Cm / A> { 

INSTANTIATE concept_by_proper<C, A> using { <C,A> } 
16 INSTANTIATE concept_by_segment<C, A> using { <C,A> } 

INSTANTIATE concept_by_value<C, A> using { <C,A> } 
18 INSTANTIATE concept_minmax<C,CM/A> using { <C,Cm,A> } 

INSTANTIATE segment<C> using { <C> } 
20 INSTANTIATE segment_ range<C , Cm> using { <C,Cm> } } 



Fig. 15: Template for different property kinds 



able and concept_by_segment<c,A> allows us to propagate the 
segment labels to the fields. 

(4) For a numerical type (price or seats), we also in- 
stantiate the segment, range and concept_minmax templates, en- 
abling the classification of range segments and fields. 

With these templates, we can derive an OPAL annota- 
tion and domain schema very quickly from a given domain 



schema such as Figure 14 



First, we normalize the ontology: If a class C has sub- 
classes without additional properties (type classes), we gen- 
erate a new categorical property C_type, add all labels from 
the sub-classes to that property, and remove the sub-classes. 

Second, we derive the annotation schema and, in partic- 
ular, the necessary annotators as follows: 

(1) For each concept or property c of the ontology, we 
create an annotation type c. All labels of c, possibly enriched 
with synonyms from an external knowledge base such as 
Wordnet, form an annotator for the proper labels of the con- 
cept (isLabeIc). 



(2) For categorical concepts or properties, we require 
a given list of instances, an existing annotator, or another 
entity recogniser, again possibly provided by an external 
knowledge base such as DBPedia or LinkedGeoData. Nu- 
merical values are treated similarly, though these often take 
simply the form of number in a certain range. This provides 
isValuGc- 

Third, we derive the domain schema in four steps: 

(1) For each class C, add an instantiation rule for 
object_type<c>. In our example, this yields 6 instantiations 
(recall, that type classes are normalised to properties above). 

(2) For each property, add an instantiation rule of cor- 
responding type, e.g., 

INSTANTIATE numeric_type<C, Cm / A> using {<price , pricem /P/'/ce>} 
In our example, this yields 22 instantiations (20 properties 



from Figure [14] and two . .._type properties). 

(3) Determine which "presentational" fields and seg- 
ments occur in the given domain and add them to the do- 
main schema. A field or segment is presentational, if it de- 
termines the way the results are represented. In the used car 
and real estate domains, we identify two types of presen- 
tational fields: "order-by" and "pagination" which control 
the order in which the results are presented as well as the 
number of results per page. These presentational types are 
mostly shared between domains and can be easily reused 
thanks to OPAL-TL templates: 



INSTANTIATE categorical_type<C , A> using 

{ <ORDER_BY, order_by> <pagination, pagination> } 



In this step, we also add generic rules that are independent of 
the domain, e.g., for the form itself and domain-independent 
form facilities such as submit buttons or generic keyword 
search fields. 

(4) Sometimes small manual adjustments are necessary. 
For example, numerical types may occur with multiple units 
of measure or other modifiers, e.g., prices with different 
currencies or locations with a search radius. Such modifier 
fields are usually unique in their corresponding segment and 
thus added using the segment_with_unique<C,u> template. In 
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Fig. 16: Used car: classified form 



the used car domain, we can observe this for currency and 
radius: 



INSTANTIATE TEMPLATE segment_with_unique<C , U> using 

{ <PRICE, CURRENCY> <LOCATION , RADIUS> } 

INSTANTIATE TEMPLATE concept_by_proper<C, A> using 

{ <CURRENCY, ci7/'/-ency>, <R AD in s , radius> } 
INSTANTIATE TEMPLATE concept_by_value<C, A> using 

{ <CURRENCY, ci7r/-ency>, <r ad \i\ s , radius> } 

Some object types, in particular location, may also be entered 
as a whole through free text fields and accordingly instanti- 
ate the f ree_text_type template for them: 

INSTANTIATE TEMPLATE f ree_text_type<C, A> using 
{ <LOCATiON ,/oca?/on } 



Finally, we need to determine part-of and precedence 
between types. The part-of relation is derived from the as- 
sociations of the domain schema, e.g., address -o location, 

POSTCODE — O LOCATION, FUEL_TYPE — O ENGINE for OUr CaSC. PrCCC- 

dence requires some observation of cases where annota- 
tions for different types overlap. Typically, we want to 
give presentational types precedence over all domain types 
(as they often contain values such as "sort by price"). 
For the used car domain, we observe that pagination ^ 
0RDER_BY and that both have precedence over all domain 
types. We also observe that mileage and radius (in locations) 
can have overlapping values. Though radius is only used in 
segment_with_unique<C,U>, for location segments which disal- 
low mileage elements, we add mileage radius to express a 
bias for mileage. 



Figure 16 shows a form from the used car domain fully 
classified according to this domain schema. 

5 Light-weight Form Integration 

opal's form models allow the easy implementation of many 
types of applications that require automatic understanding 
and interaction with forms, such as form integration and fill- 
ing, data extraction, or web automation. As discussed in Sec- 
tion[2j we focus here on form integration (or filling), i.e., the 



part of a web integration system (14] that translates a query 
on the global schema (opal's domain schema) to a query 
against concrete forms. In this section, we introduce a light- 
weight form integration system that performs this task fully 
automatically for thousands of forms in a domain, given 
only an OPAL domain schema. We have instantiated this sys- 
tem for the real estate and used car domain, but OPAL is as 
easily applied to other domains, since only a very limited 
amount of additional customisation is needed (on type vari- 
ations and, possibly, similarities). 

Recall, that we focus on the optimistic, single-query 
variant of the form integration problem: We aim for a single- 
query that returns all results matching the global (or master) 
query, but allow to return also non-matching results, if there 
is no more specific query that returns all matching ones. 

opal's form integration translates the master query into 
concrete queries through a small set of translation rules sup- 
ported by a notion of similarity on property values. OPAL 
can perform form integration without any other information 
than what is provided by an OPAL domain schema and corre- 
sponding form model. However, it can be further improved 
by providing additional domain- specific information. 

Similarity on values is represented as a real-valued 
function on pairs of values and is based on the property 
type: For free-text and categorical properties, OPAL uses a 
mix of Levenshtein and longest common substring distance, 
for numeric properties a difference-based similarity. A do- 
main schema can be enhanced by property- specific similar- 
ity function, e.g., to deal with different units of measure. A 
small set of such functions is provided with OPAL: for price, 
for distance properties, and for dates. 

Translation rules use these similarity functions to trans- 
late the constraints of the master query Q into queries on the 
concrete forms. For each form F with form model M and 
constraint C G 2 on type T, we retrieve the fields /i , . . . , 
classified with T. Let values (C) be the (possibly infinite) set 
of values for which C holds. 

(1) Single field, single value: lfn = 1, values (C) = {v}, and 
(i) /i is a free text input, return f\ = v. 
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(ii) /i is a select box, return /i = v' where v' is the op- 
tion of /i most similar to v. 
(2) Multi field: If n> 1, 

(i) values (C) = {v}, and all ft are radio buttons (exclu- 
sive options), return = true for the fk that is most 
similar to v. 

(ii) values(C) = {vi, . . . ,Vy^} and all fi are check boxes 
(non-exclusive options), return = true for each 
fk where a vt exists such that the similarity of 
and V; is minimal among all such pairs. 

(iii) and all are free-text range input fields (i.e., of 
type Tm, where Tm is the minmax type to T), then 
return fis = vi for each fis that is a minimum input 
and fie = Vk for each that is a maximum input. 

(iv) and all fii are select-box range input fields, then re- 
turn fis = v[ for each fis that is a minimum input 
where v[ is the most similar option of fis to Vi that 
is smaller or equal to vi . Analog for fi^. 

In all other cases (e.g., a select box for a set inclusion con- 
straint), we return no constraints to avoid false negatives. 

In many domains, we can observe that the same informa- 
tion is represented in alternative ways on different sites. E.g., 
the age of a car is represented by the manufacturing year on 
same sites. Similarly, the location of property may be given 
as a street address, a postcode, or even just a town, in partic- 
ular for rural agencies. To treat this cases, we need to be able 
to translate a constraint such as "age = 6" to a constraint 
"year = 2006" or "postcode = 0X1" to "town = Oxford". 
We call AGE and year type variants and amend the domain 
schema with a value mapping for each pair of type variants. 
Value mappings for numerical properties are typically sim- 
ple conversion functions, e.g., from different units of mea- 
sure. Value mappings for categorical properties are typically 
realised by a query to an external database or service such as 
DBPedia. In our example domains, we use value mappings 
for conversions of metric and imperial distances as well as 
of postcodes to towns and other locations. To treat type vari- 
ants we perform the following test and translation before the 
aforementioned translation rules: 

(0) Type variants. If ^ = and there is a field /' with type 
such that is a variant type of T, we translate the 

values in C to and continue with that constraint. 

With those simple rules, opal's form integration man- 
ages to translate most constraints as shown in Section [6] 
There are, of course, still cases where the translation fails, 
e.g., if categorical values are mapped to ranges by some or- 
dering such as road tax brackets or iPhone models (ordered 
according to year of introduction). But as demonstrated in 
Section [6j this light-weight simple form integration already 
provides us with a successful translation of a master query 
in the vast majority of cases. 

To illustrate opal's form integration, we consider the 
form of primelocation . com as shown in the middle of Fig- 
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Fig. 17: OPAL Testing Tool 



ure 17 The figure shows the OPAL testing tool that we use 
to test and verify the accuracy of OPAL domain schemas. 
It allows the user to visualize the form labels, form seg- 
ments, and classifications derived by OPAL and to track 
down, where, e.g., there are problems with the classifica- 
tion constraints or the annotations. It also provides a master 
query in the lower third. The concrete form is automatically 
filled according to the values provided in the master form. 
This allows the user to visually verify that the query has 
been translated correctly. The master form is automatically 
generated from the domain schema, but the user can provide 
additional information on which fields to include. For space 



reasons, we have focused in Figure 17 on the types most 
commonly used in constraints in the UK real estate domain. 

For the concrete form from primelocation.com, we 
highlight form fields and labels by colouring them with the 
same color (here, e.g., the "minimum" and the first price 
field). Form segments are shown as boxes with no filling 
except for their labels (a price segment with "price range" 
label). The figure shows the form afiter OPAL has filled it ac- 
cording to the values from the master query. Notice, how for 
the three select boxes for minimum and maximum price, as 
well as bedroom number, OPAL picks the closest value to the 
one specified in the master form. 



6 Evaluation 

We perform experiments on several domains across four dif- 
ferent datasets. Two datasets are randomly sampled from the 
UK real estate and UK used-car domains, respectively. We 
compare with existing approaches via ICQ and TEL- 8, two 
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Fig. 18: OPAL on 734 forms 



public benchmark sets, on which we only evaluate opal's 
form labeling. This limitation necessary to allow a compar- 
ison that is fair to existing approaches, that only label forms 
and do not use domain knowledge. Even with this limitation, 
however, OPAL outperforms previous approaches in most 
domains by at least 5%. We also perform an introspective 
analysis of OPAL to show (1) the impact of field, segment, 
layout, and repair in the form interpretation, (2) opal's 
performance and scalability with increasing page size, and 
(3) the effectiveness of the form integration in OPAL. 

We evaluate the proper assignment of text nodes to form 
fields using standard notions of precision, recall and F-score 
(harmonic mean F = F\= 2PR/ {P-\-R) of precision and re- 
call). For form labeling (classification), precision P is mea- 
sured as the proportion of correctly labeled (classified) fields 
over total labeled fields, while recall R is the fraction of cor- 
rectly labeled fields over total number of fields. For form fill- 
ing precision and recall do not apply and we therefore report 
the error rate as portion of total fields that are not correctly 
filled (i.e., either filled but with a wrong value or not filled at 
all, despite a corresponding constraint in the master query). 
For all considered datasets, we compare the extracted result 
to a manually constructed gold standard. We evaluate seg- 
mentation through their impact on classification, see Fig- 
ure [22) and the improved performance on the two datasets 



where we perform form interpretation (UK real estate and 
used car) versus the ICQ and TEL- 8 datasets. 



Datasets. For the UK real estate domain, we build a dataset 
randomly selecting 100 real estate agents from the UK yel- 
low pages (y ell . com] ). Similarly, we randomly pick 100 
used-car dealers from the UK largest aggregator website 
autotrader.co.uk The forms in these two domains have 
significantly different characteristics than the ones in ICQ 
and TEL- 8, mainly due to changes in web technology and 
web design practices. The usage of CSS stylesheets for lay- 
out and AJAX features are among the most relevant. 



The ICQ and TEL- 8 datasets cover several domains. 
ICQ presents forms from five domains: air traveling, (used) 
cars, books, jobs, (U.S.) real estate. There are 20 web pages 
for each of the domains, but two of them are no longer acces- 
sible and thus excluded from this evaluation. TEL- 8, on the 
other hand, contains forms from eight domains: books, car 
rental, jobs, hotels, airlines, auto, movies and music records. 
The dataset amounts to 477 forms, but only 436 of them are 
accessible (even in the cached version). 

6.1 Field Labeling 

In our first experiment we evaluate the accuracy of opal's 
field labeling on all four datasets, but only in the UK real es- 
tate and used car domain we employ the form interpretation 



to further improve the field labeling. Figure 18 shows the re- 
sults. The first two bars are for the random sample datasets. 
For the real estate domain, OPAL classifies fields with perfect 
precision and 98.6% recall. Overall we obtain a remarkable 
99.2% F-score. The result is similar for the used car domain, 
where OPAL obtain 98.2% precision and 99.2% recall, that 
amount to 98.7% F-score. OPAL achieves lower precision 
than recall in the used car domain due to the fact that web 
forms in this domain are more interactive: certain fields are 
enabled only when some other field is filled properly, yet 
non-field placeholders are present in the HTML to indicate 
that a field will appear when the other field is filled. This 
introduces noise to field labeling and thus classification. 

For the real estate domain, our domain schema consists 
of a few dozen field and segment types and about 40 annota- 
tion types. Similarly, in the used car domain, there are about 
30 annotation types. Creating an initial domain schema (in- 
cluding gazetteers and testing) takes a single person familiar 
with a domain and OPAL-TL roughly 1 week. 

The other two entries in Figure [18] regard field labeling 
on ICQ and TEL- 8 datasets. On these, OPAL applies only 
its domain-independent scopes (field, segment, scope) as no 
domain schema is available for these domains. Nonetheless, 
OPAL reports very high accuracy also on these forms, con- 
firming the effectiveness of our domain-independent analy- 
sis. Not unexpected, OPAL performs better in the UK real 
estate and used car domain where domain knowledge is 
present, even though the forms in those datasets are on av- 
erage more modem and contain more fields (10.4 and 9.2 
fields per form in the real-estate and used-car dataset versus 
6.5 and 7.9 fields per form for ICQ and Tel-8). 

Cross Domain Comparison. We use ICQ and TEL-8 to 
compare field labeling in OPAL against existing approaches. 



on a wide set of domains. Figure 19a details the result of 
OPAL on each domain of the ICQ dataset. It shows perfect 
F-score values for the jobs domain (100%) as well as auto 
and air travelling (99.3% and 98.3%). For comparison, ifTOl 
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Fig. 19: OPAL on ICQ and TEL-8 benchmark 
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reports 92% F-score for labeling on ICQ on average, which 
we outperform even in the domain most difficult for OPAL 
(books). 1321 reports slightly better precision and recall than 
ifTOl . but OPAL still outperforms it by several percents. 

The results for the TEL-8 dataset are depicted in Fig- 
ure 19b Here, the overall F-score is 96.3%, again mostly 
affected by the performance in the books domain. Note that, 
especially on TEL-8, OPAL obtains very high precision com- 
pared to recall. Indeed, lower recall means OPAL is not able 
to assign labels to all fields, missing some of them. For 
comparison, fTO| reports 88 — 90% overall F-score, which 
we outperform by a wide margin. [23 1 reports F-scores be- 
tween 89% and 95% for four domains in the TEL-8 dataset. 
Though they perform slightly better on books, we signifi- 
cantly outperform them on the three other domains included 
in their results, as well as on average. 

In Section [4j we discuss that OPAL prioritises field over 
segment over layout scope and we claim that this is due to 
the decreasing precision. Table [T] shows the total number of 
fields labeled in each scope, as well as the number and per- 
centage of false positives among those labels. It illustrates 
that, indeed, the field scope produces almost no false posi- 
tives (2 out of 762 fields labeled in this scope, i.e., 0.3%), 
the segment scope also produces very few (3 out of 154 la- 
beled fields), and the layout scope produces most (8 out of 
72 labeled fields). 
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Fig. 20: The Bookbeat form with source 



What keeps OPAL from achieving 100% accuracy? Most 
of the cases are due to opal's assumption that form labels 
are separate text nodes. This is evidently the case in most 
forms, as demonstrated by near perfect accuracy, but there 
are some outliers that use image only labels or merge mul- 
tiple labels into one node and use whitespace to achieve 
the desired result. Figure [20| e.g., shows a form where "Ti- 



tle" and "Keyword" are a single HTML node with   
spaces in between. While both cases are easy enough to 
address, they do require specific treatment and we omitted 
them from the version of OPAL presented here to illustrate 
that even without any such specifically tailored heuristics, 
we can achieve nearly perfect form labeling and interpreta- 
tion. 
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Fig. 21: Classification error example 



6.2 Form Interpretation 

The quality of opal's form interpretation depends on the 
quality of the form labeling and that of the annotators. As 
discussed above, for this evaluation we use annotators that 
have been created in about 1 week for the UK real estate and 
used car domain. The location related annotators are based 
on standard sources (GeoNames and LinkedGeoData) and 
thus have reasonable recall, but precision is fairly low, due to 
the high number of locations in the UK that are homonyms 
to common English words (e.g., the town of "Selling"). Such 
noise in the value annotators, however, affects OPAL very lit- 
tle, as the values of form fields are only used if the labels are 
inconclusive and we only use the most frequent annotation 
type. Noise in the label values is far more likely to lead to 
classification errors. However, typical annotators are small 
lists of 5 — 10 typical labels which are easy to create and 
have very low noise. E.g., for bedroom labels we use just 
"bedroom", "bed", and their plural forms, for make, model, 
mileage and many more just "make", "model", "mileage", 
and their plural form, resp. 

With this, we achieve near perfect classification, cor- 
rectly classifying most of the fields, see Table [T} Precision 
is 97.3% over all fields in the real estate data set (with just 
24 out of 93 1 classified fields incorrectly classified) and re- 
call 97.4%. This excludes 56 (or 5.5%) fields for which our 
domain schema does not contain a concept (usually as they 
appear only very rarely). 

Classification errors are mostly caused by ambiguity in 



the used form labels. For example. Figure 21 shows a form, 
where the "model style" field is erroneously classified as a 
MODEL field by OPAL. The field has a proper label "model 
style" which is correctly assigned to the field in the field 
labeling, as are the field values "4x4", "City Car", etc. In 
the classification, we prioritise proper labels over values (as 
value annotators are more noisy). In most cases, this is in- 
deed preferable, but here the proper label "model style" is 
annotated with model and we classify the field as model rather 
than car_type, as "model style" is not recognised as a la- 
bel for car_type. A probabilistic classifications that combines 
classifications from labels and values (with a lower weight) 
would allows us to choose the most likely global form clas- 
sification and thus to address such outliers. However, this 
would also increase the effort in creating a domain schema. 




■domain 

layout 
■segment 

field 



Real-estate Used-car 



Fig. 22: Scopes 



100 




(errors) 



Real-Estate Used-Car 



Fig. 23: Form integration errors (per form) 



6.3 Contributions of Scopes 

We demonstrate the effectiveness of combining different 
types of analysis by measuring to what extent each of our 
four scopes contributes to the overall quality of form un- 
derstanding. We use again the two domain datasets from the 
previous experiment. For both we show the results for recall, 
though the picture is similar for precision and F-score, cf. 



Figure 18 As illustrated in Figure 22 for the field labeling 
in the real-estate dataset, the field scope already contributes 
significantly (67%). The Segment scope increases recall by 
18%, layout scope and the repair in the form interpretation 
add together another 13%. Note that, the contribution of the 
repair in the form interpretation is more significant than that 
of the layout scope, indicating the importance of domain 
knowledge to achieve very high accuracy form understand- 
ing. In the used car domain, field scope alone is even more 
significant 85% (as many of the websites use modern web 
technologies and frameworks with reasonable structure). 



6.4 Form Integration 

For the evaluation of the form integration, we determine the 
error rate in the query translation for all forms in the used 
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Fig. 24: Types and distribution of form integration errors 



with very low constants. There is a significant amount of 
outHers, partially due to long page rendering time and par- 
tially due to variance in the depth and sophistication of the 
HTML structure. Figure [25] reports OPAL performance on 
all 534 forms in the combined TEL- 8 and ICQ datasets. The 
highlight area covers 80% of the forms with 2200 nodes. 
OPAL requires at most 30s for the analysis (including page 
rendering) of these forms. Further analysis on the effect of 
increasing field or form numbers confirms that these have 
little effect and page size is the dominant factor. 
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car and real estate datasets. We use multiple master queries 
in both cases, using for the real estate domain combinations 
of location, min price, max price, and min bedroom. For the 
used car domain, we use combinations of location, make, 
model, min price, and max price. We evaluate the constraints 
separately and consider a constraint correctly translated, if it 
involves the right field on the concrete form and uses the best 
matching value. Overall, OPAL generates 95.6% and 93.8% 
correctly translated constraints. 



Figure 23 presents the number of web forms where OPAL 
fails to translate one or more constraints incorrectly. Overall, 
87% of the forms were filled perfectly, and 95% of the forms 



have no more than one failure. Figure 24 presents the major 
causes for opal's failure in translating constraints: Most of 
the errors are caused by scripted forms with hidden (21%) 
or heavily customised form controls (24%). The remaining 
cases divide rather evenly between errors in the form label- 
ing (17%), in the classification or annotation (incomplete 
gazetteer), and an assortment of other issues, mostly browser 
related (e.g., scripted popovers that block access to the form 
fields or fields that can only be filled in a certain order). 



6.5 Scalability 

As discussed in Sections [3] and |4j overall the analysis of 
OPAL is bounded by 0{n^) due to the layout scope. As ex- 
pected actual performance follows a quadratic curve, but 



7 Related Work 

Form understanding has attracted a number of approaches 
motivated by deep web search 1201127 ll28l , meta- search en- 
gines and web form integration fBl l 10113 ll[32l[33l l35l and 
web extraction L29..30J . We focus here on differences to 
OPAL, for a complete survey see I1181111L We present re- 
lated work for form understanding and form integration sep- 
arately, as not all approaches consider both aspects. 



7.1 Form Understanding 

Form understanding approaches can be roughly categorised 
by the fundamental approach to the problem: 

(1) The most common type encodes (mostly domain 
independent) observations on typical forms into implicit 
heuristics or explicit rules MetaQuerier B5iL ExQ [i32l , 
SchemaTree 1 10], LITE |27|, Wise-i Extractor O, DEQUE 1281 . 
and CombMatch |[T6l . (2) Alternatively, some approaches La- 
belEx |23| and HMM fTTl use machine learning from a set 
of example forms (possibly of a specific domain). (3) Form 
understanding is often done to surface the results hidden be- 
hind the form and approaches such as [201 13111271 exploit the 
extracted results for form understanding. 

Aside of system design, OPAL primarily differs from 
these approaches in two aspects: (1) They mostly incorpo- 
rate only one or two of opal's scopes (and feature classes): 
MetaQuerier, ExQ, and SchemaTree mostly ignore the HTML 
structure (and thus field and segmentation scope) and rely on 
visual heuristics only; CombMatch, LITE, DEQUE, and LabelEx 
ignore field grouping. HMM ignores visual information. 1201 
[3ni27l use only the HTML structure, but exploit probing in- 
formation, i.e., whether a submission is successful or not. 
(2) None of the approaches provides a proper form model 
classifying the form fields according to a given schema. 
Furthermore, no approach uses domain knowledge is used 
to improve the labeling or verify the classification, though 
LabelEx analyses domain specific term frequencies of label 
texts and HMM checks for generic terms, such as "min". As 
evident in our evaluation, each of the scopes in OPAL con- 
siderably affects the quality of the form labeling and classifi- 
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cation. The fact, that each of these approaches omits at least 
one of the domain-independent scopes, explains the signifi- 
cant advantage in accuracy OPAL exhibits on Tel-8 and ICQ. 
Notice also that not using domain knowledge keeps these 
approaches out of reach of the nearly perfect field classifica- 
tion achieved by OPAL. 

Form understanding by observation and heuristics. Most 
closely related in spirit to OPAL, though very different in re- 
alisation and accuracy, is MetaQuerier |35|. It is built upon 
the assumption that web forms follow a "hidden syntax" 
which is implicitly codified in common web design rules. 
To uncover this hidden syntax, MetaQuerier treats form un- 
derstanding as a parsing problem, interpreting the page a 
sequence of "atomic visual elements", each coming with a 
number of attributes, in particular with its bounding box. In a 
study covering 150 forms, the authors of MetaQuerier identi- 
fied 21 conmion design patterns. These patterns are captured 
by production rules in grammar with preferences. Metaque- 
rier is not parameterisable for a specific domain. In contrast, 
the domain independent part of OPAL achieves nearly per- 
fect accuracy with only 6 generic patterns by combining vi- 
sual, structural, and textual features, and a simple prioritisa- 
tion of these patterns by scope, opal's domain dependent 
part allows us to adjust it for patterns specific to a domain. 

ExQ I32II is similarly based primarily on visual features 
such as a bias for the top-left located labels comparable to 
OPAL, but disregards most structural clues, such as explicit 
for attributes of label tags and does not allow for any do- 
main specific patterns. 

Also SchemaTree 1 10 | uses only visual features (and the 
tabindex and for attributes for fields and labels). It exploits 
nine observations on form design, e.g., that query interfaces 
are organised top-down and left-to-right or that fields form 
semantic groups. It uses a hierarchical alignment between 
fields and text nodes similar to opal's segment scope and 
a "schema tree" where the nine observations are observed. 
Again, no adaptation to a specific domain is possible. 

Wise-iExtractor |15 | firstly tokenizes the form to obtain 
a high-level visual layout description (an interface expres- 
sions (lEXP)), distinguishing text fragments, form fields, 
and delimiters, such as line breaks. It then associates texts 
and fields by computing the association weight between any 
given field and the texts in the same line and the two preced- 
ing lines, exploiting ending colons, similarities between the 
text and the field's HTML name attribute, and the text-field 
distance. In addition. Wise also identifies generic relation- 
ships between fields: range (e.g. from, to), part (e.g. first 
and last name), group (e.g. radio buttons), or constraint 
(e.g. exact match required). However, in contrast to OPAL 
their form labeling only explores limited visual and textual 
information relying mainly on weight computation. More- 
over, their domain-independent typing shares some similar- 



ities with opal's templates but lacks the flexibility provided 
by opal's domain schemata that allow us to adjust these 
generic types to a given domain. Though these adjustments 
are often small, their impact is significant, as shown in Sec- 
tion [6l 

In | 34 |, a (manually derived) domain schema is used to 
guide understanding. In contrast to OPAL, it segments a form 
purely based on the domain schema (called schema tree). 
They evaluate on a fragment (around 100-150 forms) of 
TEL-8 using domain schemata derived from the rest of TEL- 
8 (about 250 forms). This yields on the considered fragment 
similar accuracy as OPAL achieves on the full TEL-8, yet 
OPAL does not use any domain schema in this case, let alone 
domain schemata specifically trained on TEL-8. 

Form understanding by learning from example forms. 
Where the above approaches rely on humans to derive 
heuristics and rules for form understanding, the following 
approaches use machine learning on a set of example forms. 
Therefore, they can also be trivially adapted to a specific do- 
main by using domain-specific training data. The evaluation 
in fVl\, however, shows little effect of domain-specific train- 
ing data: a training set from the biological domain outper- 
forms domain- specific training set in four out of five other 
domains. 

Label Ex |23l uses limited domain knowledge when con- 
sidering the occurrence frequencies of label terms. Domain 
relevance of the terms occurring in a label, measured as the 
occurrence frequency in previous forms, is one feature used 
to score field-label candidates. Field-label candidates are 
otherwise created primarily using neighbourhood and other 
visual features, as well as their HTML markup. However, 
Label Ex does not consider field groups and thus is unable to 
describe segments of semantically related fields or to align 
fields and labels based on the group structure and does not 
use any domain knowledge aside of term frequency. 

HMM ifTTl uses predefined knowledge on typical terms 
in forms, such as "between", "min", or "max", but does not 
adapt these for a specific domain. HMM employs two hidden 
Markov models to model an "artificial web designer". Dur- 
ing form analysis, the HMMs are used to explain the phe- 
nomena observed on the page: The state sequences, that are 
most likely to produce the given web form, are considered 
explanations of the form. Compared to OPAL, HMM uses no 
visual features and no domain knowledge. 

Form understanding by probing. All the above approaches 
conduct their analysis based purely on information avail- 
able on the web forms. Alternatively, there is also an indi- 
rect route for form understanding by submitting the forms 
and analysing the query results, which often are much eas- 
ier to classify (as there are many instances compared to a 
single form). The price is, however, that a certain amount of 
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analysis of those result pages is necessary. Therefore, this is 
primarily used in a context where such analysis is anyway 
required, e.g., in crawlers or data extraction systems. Typi- 
cally, these approaches use an incremental approach, identi- 
fying inputs for some fields, submitting the form, analysing 
the result page, and then possibly restarting the whole pro- 
cess, now with, e.g., an increased set of input values for the 
form. For example, |20| determines whether a field must 
be filled or is a "free" input by iterating over possible tem- 
plates and selecting those that return sufficiently distinct re- 
sult pages. This is driven by the desire to surface some rep- 
resentative, but not necessarily complete set of results from 
the web form. None of these approaches produces a sophis- 
ticated form model, but at best rough classifications of the 
fields and whether they are mandatory. 



7.2 Form Filling and Integration 

Form integration has been considered in many shapes, either 
as "meta- search" where a master query on a given global 
schema is translated to concrete forms as in OPAL, as "inter- 
face matching" where many concrete forms are integrated 
without a global schema (often using schema matching), or 
as "query generation" in the context of data extraction or 
crawling where the aim is to generate a set of queries to ex- 
tract all or most of the data, but often not even full form 
understanding is performed. 

Though some query generation and most interface 
matching approaches use form understanding, they are fo- 
cused on different issues than opal's form integration 
which is a type of "meta- search": How to find an opti- 
mal query set that uncovers as much deep content as pos- 
sible 1 3 1, how to determine if a query will produce relevant 
data even if only partial information about the data is avail- 
able 0, how to maximize the diversity of the extracted con- 
tent 1201 . or how to identify semantic equivalences between 
fields from different forms [M] . 

Similar to OPAL, Q fills web forms by connecting 
fields at the conceptual level, but with WordNet (261 in- 
stead of proper annotations. Furthermore, OPAL produces 
more structured form model that is verified against a do- 
main schema. Metaquerier | 9 |, targets the integration of web 
sources and tackles query translation for form filling in that 
context. As OPAL, Metaquerier selects values closest to the 
constraint in the source query (similar to our master query). 
They also perform type-based query translation to map a 
source query to a target query considering numeric and text 
types, but achieve only 87% accuracy. OPAL performs form 
filling in a similar fashion, but also considers the number of 
fields for each domain type in the master query and performs 
significantly better (93%). 



8 Conclusion and Future Work 

To the best of our knowledge, OPAL is the first comprehen- 
sive approach to form understanding and integration. Previ- 
ous form understanding approaches has been limited mainly 
by overly generic, domain independent, monolithic algo- 
rithms relying on narrow feature sets. OPAL pushes the state 
of the art significantly, addressing these limitations through 
a very accurate domain independent form labeling, exploit- 
ing visual, textual, and structural features, by itself already 
outperforming existing approaches. This domain indepen- 
dent part is complemented with a domain dependent form 
field classification that significantly improves the overall 
quality of the form understanding and produces nearly per- 
fect form interpretations. Accurate form interpretations en- 
ables form integration: OPAL successfully realizes a light- 
weight form integration system, able to translate master 
queries to forms of a domain with nearly no errors. 

Nevertheless, there remain open issues in OPAL and form 
understanding in general that need to be addressed for form 
understanding to become a reliable tool to access web data 
through forms with little more effort than through APIs: 

(1) Dynamic, scripted forms: OPAL is able to under- 
stand most static forms with near perfect accuracy, but per- 
forms much worse on dynamic forms. We are already work- 
ing on an extension of OPAL for dealing with dynamic, heav- 
ily scripted interfaces that combines ideas from state explo- 
ration and crawling with form understanding. 

(2) Customised form widgets: More and more forms 
use customised widgets such as tree views or sliders. 
Though most of these cases use hidden form fields that can 
be analysed by OPAL, the use of fully scripted cases in- 
creases. We plan to extend OPAL to allow the customisation 
of the form widgets that it can recognise. However, if these 
cases become more common, it may become necessary to 
automatically explore and learn such new widget types. 

(3) Probing-based understanding: One of opal's 
virtues is that it achieves its near perfect accuracy with- 
out any probing, but thus from the form page alone. How- 
ever, this also limits the information that OPAL can provide, 
and prevents the verification and repair of the form model 
through the results returned by a form submission. For ap- 
plications that need to access the result pages (e.g., data ex- 
traction and surfacing), we plan to integrate OPAL with the 
result page analysis system AMBER 1 13 1 to further improve 
accuracy and to address integrity and access constraints. 

(4) Integrity and access constraints. OPAL produces 
some integrity constraints through the domain schema and 
it's form segmentation, e.g., dependencies between min and 
max fields in a range segment. We see an increase in the use 
of integrity constraints in forms thanks to the availability 
of easy-to-use client-side validation libraries. Light-weight 
methods for analysing and exploiting such client side vali- 
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dation would allow us to extend our form models with more 
detailed integrity constraints. This is in addition to integrity 
and access constraints derived from probing. 



9 Acknowledgements 

The research leading to these results has received funding 
from the European Research Council under the European 
Community's Seventh Framework Programme (FP7/2007- 
2013) / ERC grant agreement DIADEM, no. 246858. Gior- 
gio Orsi has also been supported by the Oxford Martin 
School's grant no. LC0910-019. 



References 

1. S. Araujo, Q. Gao, E. Leonardi, and G.-J. Houben. Carbon: 
domain-independent automatic web form filling. In Proc. Int'l. 
Conf. on Web Engineering (ICWE), pages 292-306, 2010. 

2. Z. Bar-Yossef and M. Gurevich. Random sampling from a search 
engine's index. / ACM, 55(5):24: 1-24:74, 2008. 

3. L. Barbosa and J. Freire. Siphoning hidden- web data through 
keyword-based interfaces. In Proc. Brazilian Symp. on Database, 
pages 309-321, 2004. 

4. L. Barbosa and J. Freire. Combining classifiers to identify online 
databases. In Proc. Int'l. World Wide Web Conf. (WWW), pages 
431^40, 2007. 

5. M. Benedikt, G. Gottlob, and P. Senellart. Determining relevance 
of accesses at runtime. In Proc. Symp. on Principles of Database 
Systems (PODS), pages 211-222, 2011. 

6. M. Benedikt and C. Koch. XPath leashed. ACM Computing Sur- 
veys, pages 3:1-3:54, 2007. 

7. A. Bilke and F. Naumann. Schema matching using duplicates. 
In Proc. Int'l. Conf. on Data Engineering (ICDE), pages 69-80, 
2005. 

8. M. J. Cafarella, E. Y. Chang, A. Fikes, A. Y. Halevy, W. C. Hsieh, 
A. Lerner, J. Madhavan, and S. Muthukrishnan. Data management 
projects at google. Sigmod Records, 37(l):34-38, 2008. 

9. K. C.-C. Chang, B. He, and Z. Zhang. Mining semantics for large 
scale integration on the web: evidences, insights, and challenges. 
SIGKDD ExploK News!., 6(2):67-76, Dec. 2004. 

10. E. C. Dragut, T. Kabisch, C. Yu, and U. Leser. A hierarchical ap- 
proach to model web query interfaces for web source integration. 
In Proc. Int'l. Conf. on Very Large Data Bases (VLDB), pages 
325-336, 2009. 

11. E. C. Dragut, W. Meng, and C. T. Yu. Deep Web Query Inter- 
face Understanding and Integration. Synthesis Lectures on Data 
Management. Morgan & Claypool Publishers, 2012. 

12. T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and C. Schall- 
hart. Opal: automated form understanding for the deep web. In 
Proc. Int'l. World Wide Web Conf (WWW), pages 829-838, 2012. 

13. T. Furche, G. Gottlob, G. Grasso, G. Orsi, C. Schallhart, and 
C. Wang. Little Knowledge Rules The Web: Domain-Centric Re- 
sult Page Extraction. In Proc. Int'l. Conf. on Web Reasoning and 
Rule Systems (RR), pages 61-76, 2011. 

14. B. He, Z. Zhang, and K. C.-C. Chang. Towards building a meta- 
querier: Extracting and matching web query interfaces. In Proc. 
Int'l. Conf. on Data Engineering (ICDE), pages 1098-1099, 2005. 

15. H. He, W. Meng, Y. Lu, C. Yu, and Z. Wu. Towards deeper under- 
standing of the search interfaces of the deep web. Word Wide Web, 
10:133-155, 2007. 



16. O. Kaljuvee, O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. 
Efficient web form entry on pdas. In Proc. Int'l. World Wide Web 
Conf (WWW), pages 663-672, 2001. 

17. R. Khare and Y. An. An empirical study on using hidden markov 
model for search interface segmentation. In Proc. Int'l. Conf. on 
Information and Knowledge Management (CIKM), pages 17-26, 
2009. 

18. R. Khare, Y. An, and I.-Y. Song. Understanding deep web search 
interfaces: A survey. Sigmod Records, 39(1):33^0, 2010. 

19. J. Lehmann, T. Furche, G. Grasso, A.-C. N. Ngomo, C. Schall- 
hart, A. Sellers, C. Unger, L. Buhmann, D. Gerber, D. L. Konrad 
Hoffner and, and S. Auer. Deqa: Deep web extraction for question 
answering. In Proc. Int'l. Semantic Web Conf. (ISWC), 2012. 

20. J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and 
A. Halevy. Google's deep web crawl. In Proc. Int'l. Conf. on Very 
Large Data Bases (VLDB), pages 1241-1252, 2008. 

21. A. Maiti, A. Dasgupta, N. Zhang, and G. Das. Hdsampler: reveal- 
ing data behind web form interfaces. In Proc. Symp. on Manage- 
ment of Data (SIGMOD), pages 1131-1134, 2009. 

22. I. Navarrete, A. Morales, M. Cardenas, and G. Sciavicco. Spatial 
reasoning with rectangular cardinal relations - the convex tractable 
subalgebra. In Annals of Mathematics and Artificial Intelligence, 
2012. 

23. H. Nguyen, T. Nguyen, and J. Freire. Learning to extract form 
labels. In Proc. Int'l. Conf. on Very Large Data Bases (VLDB), 
pages 684-694, 2008. 

24. T. H. Nguyen, H. Nguyen, and J. Freire. PruSM: a prudent schema 
matching approach for web forms. In Proc. Int'l. Conf. on Infor- 
mation and Knowledge Management (CIKM), pages 1385-1388, 
2010. 

25. F. Niu, C. Zhang, C. Re, and J. Shavlik. DeepDive: Web-scale 
knowledge-base construction using statistical learning and infer- 
ence. In Proc. Very Large Data Search (VLDS), pages 25-28, 
2012. 

26. T. Pedersen, S. Patwardhan, and J. Michelizzi. Wordnet:: similarity 
- measuring the relatedness of concepts. In Proc. HLT-NAACL- 
Demonstrations, pages 38-41, 2004. 

27. S. Raghavan and H. Garcia-Molina. Crawling the hidden web. In 
Proc. Int'l. Conf. on Very Large Data Bases (VLDB), pages 129- 
138, 2001. 

28. D. Shestakov, S. Bhowmick, and E. Lim. Deque: querying the 
deep web. Data & Knowledge Engineering (DKE), 52(3):273- 
311,2005. 

29. W. Su, J. Wang, and F. H. Lochovsky. ODE: Ontology- 
assisted data extraction. ACM Transactions on Database Systems, 
34(2):12:1-12:35, 2009. 

30. J. Wang and F. H. Lochovsky. Data extraction and label assign- 
ment for web databases. In Proc. Int'l. World Wide Web Conf 
(WWW), pages 187-196, 2003. 

31. J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma. Instance-based 
schema matching for web databases by domain-specific query 
probing. In Proc. Int'l. Conf. on Very Large Data Bases (VLDB), 
pages 408^19, 2004. 

32. W. Wu, A. Doan, C. Yu, and W. Meng. Modeling and extracting 
deep- web query interfaces. In Advances in Information & Intelli- 
gent Systems, pages 65-90, 2009. 

33. W. Wu, C. T. Yu, A. Doan, and W. Meng. An interactive 
clustering-based approach to integrating source query interfaces 
on the deep web. In Proc. Symp. on Management of Data (SIG- 
MOD), pages 95-106, 2004. 

34. X. Yuan, H. Zhang, Z. Yang, and Y. Wen. Understanding the 
search interfaces of the deep web based on domain model. In Proc. 
Int'l Conf. on Computer and Information Science, pages 1194- 
1199, 2009. 

35. Z. Zhang, B. He, and K. C.-C. Chang. Understanding web query 
interfaces: Best-effort parsing with hidden syntax. In Proc. Symp. 
on Management of Data (SIGMOD), 2004. 



